License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2006-12-13 11:34:23 +03:00
|
|
|
* Written by Mark Hemment, 1996 (markhe@nextd.demon.co.uk).
|
|
|
|
*
|
2008-07-04 20:59:22 +04:00
|
|
|
* (C) SGI 2006, Christoph Lameter
|
2006-12-13 11:34:23 +03:00
|
|
|
* Cleaned up and restructured to ease the addition of alternative
|
|
|
|
* implementations of SLAB allocators.
|
2013-09-04 20:35:34 +04:00
|
|
|
* (C) Linux Foundation 2008-2013
|
|
|
|
* Unified interface for all slab allocators
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_SLAB_H
|
|
|
|
#define _LINUX_SLAB_H
|
|
|
|
|
2006-12-07 07:33:22 +03:00
|
|
|
#include <linux/gfp.h>
|
2018-05-08 22:52:32 +03:00
|
|
|
#include <linux/overflow.h>
|
2006-12-07 07:33:22 +03:00
|
|
|
#include <linux/types.h>
|
2012-12-19 02:22:50 +04:00
|
|
|
#include <linux/workqueue.h>
|
2019-07-12 06:56:27 +03:00
|
|
|
#include <linux/percpu-refcount.h>
|
2012-12-19 02:22:50 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-12-13 11:34:23 +03:00
|
|
|
/*
|
|
|
|
* Flags to pass to kmem_cache_create().
|
2015-04-15 01:44:28 +03:00
|
|
|
* The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2017-11-16 04:32:18 +03:00
|
|
|
/* DEBUG: Perform (expensive) checks on alloc/free */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_CONSISTENCY_CHECKS ((slab_flags_t __force)0x00000100U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* DEBUG: Red zone objs in a cache */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* DEBUG: Poison objects */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_POISON ((slab_flags_t __force)0x00000800U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Align objs on cache lines */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Use GFP_DMA memory */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_CACHE_DMA ((slab_flags_t __force)0x00004000U)
|
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 06:43:42 +03:00
|
|
|
/* Use GFP_DMA32 memory */
|
|
|
|
#define SLAB_CACHE_DMA32 ((slab_flags_t __force)0x00008000U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* DEBUG: Store the last owner for bug hunting */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_STORE_USER ((slab_flags_t __force)0x00010000U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Panic if kmem_cache_create() fails */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_PANIC ((slab_flags_t __force)0x00040000U)
|
2008-11-13 21:40:12 +03:00
|
|
|
/*
|
2017-01-18 13:53:44 +03:00
|
|
|
* SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
|
2008-11-13 21:40:12 +03:00
|
|
|
*
|
|
|
|
* This delays freeing the SLAB page by a grace period, it does _NOT_
|
|
|
|
* delay object freeing. This means that if you do kmem_cache_free()
|
|
|
|
* that memory location is free to be reused at any time. Thus it may
|
|
|
|
* be possible to see another object there in the same RCU grace period.
|
|
|
|
*
|
|
|
|
* This feature only ensures the memory location backing the object
|
|
|
|
* stays valid, the trick to using this is relying on an independent
|
|
|
|
* object validation pass. Something like:
|
|
|
|
*
|
|
|
|
* rcu_read_lock()
|
|
|
|
* again:
|
|
|
|
* obj = lockless_lookup(key);
|
|
|
|
* if (obj) {
|
|
|
|
* if (!try_get_ref(obj)) // might fail for free objects
|
|
|
|
* goto again;
|
|
|
|
*
|
|
|
|
* if (obj->key != key) { // not the object we expected
|
|
|
|
* put_ref(obj);
|
|
|
|
* goto again;
|
|
|
|
* }
|
|
|
|
* }
|
|
|
|
* rcu_read_unlock();
|
|
|
|
*
|
2013-10-24 05:07:42 +04:00
|
|
|
* This is useful if we need to approach a kernel structure obliquely,
|
|
|
|
* from its address obtained without the usual locking. We can lock
|
|
|
|
* the structure to stabilize it and check it's still at the given address,
|
|
|
|
* only if we can be sure that the memory has not been meanwhile reused
|
|
|
|
* for some other kind of object (which our subsystem's lock might corrupt).
|
|
|
|
*
|
|
|
|
* rcu_read_lock before reading the address, then rcu_read_unlock after
|
|
|
|
* taking the spinlock within the structure expected at that address.
|
2017-01-18 13:53:44 +03:00
|
|
|
*
|
|
|
|
* Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
|
2008-11-13 21:40:12 +03:00
|
|
|
*/
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Defer freeing slabs to RCU */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_TYPESAFE_BY_RCU ((slab_flags_t __force)0x00080000U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Spread some memory over cpuset */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_MEM_SPREAD ((slab_flags_t __force)0x00100000U)
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Trace allocations and frees */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_TRACE ((slab_flags_t __force)0x00200000U)
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2008-04-30 11:54:59 +04:00
|
|
|
/* Flag to prevent checks on free */
|
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_DEBUG_OBJECTS ((slab_flags_t __force)0x00400000U)
|
2008-04-30 11:54:59 +04:00
|
|
|
#else
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_DEBUG_OBJECTS 0
|
2008-04-30 11:54:59 +04:00
|
|
|
#endif
|
|
|
|
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Avoid kmemleak tracing */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_NOLEAKTRACE ((slab_flags_t __force)0x00800000U)
|
2009-06-11 16:22:40 +04:00
|
|
|
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Fault injection mark */
|
2010-02-26 09:36:12 +03:00
|
|
|
#ifdef CONFIG_FAILSLAB
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_FAILSLAB ((slab_flags_t __force)0x02000000U)
|
2010-02-26 09:36:12 +03:00
|
|
|
#else
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_FAILSLAB 0
|
2010-02-26 09:36:12 +03:00
|
|
|
#endif
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Account to memcg */
|
2018-08-18 01:47:25 +03:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_ACCOUNT ((slab_flags_t __force)0x04000000U)
|
2016-01-15 02:18:15 +03:00
|
|
|
#else
|
2017-11-16 04:32:21 +03:00
|
|
|
# define SLAB_ACCOUNT 0
|
2016-01-15 02:18:15 +03:00
|
|
|
#endif
|
kmemcheck: add mm functions
With kmemcheck enabled, the slab allocator needs to do this:
1. Tell kmemcheck to allocate the shadow memory which stores the status of
each byte in the allocation proper, e.g. whether it is initialized or
uninitialized.
2. Tell kmemcheck which parts of memory that should be marked uninitialized.
There are actually a few more states, such as "not yet allocated" and
"recently freed".
If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
memory that can take page faults because of kmemcheck.
If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
request memory with the __GFP_NOTRACK flag. This does not prevent the page
faults from occuring, however, but marks the object in question as being
initialized so that no warnings will ever be produced for this object.
In addition to (and in contrast to) __GFP_NOTRACK, the
__GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
not be tracked _because_ it would produce a false positive. Their values
are identical, but need not be so in the future (for example, we could now
enable/disable false positives with a config option).
Parts of this patch were contributed by Pekka Enberg but merged for
atomicity.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2008-05-31 17:56:17 +04:00
|
|
|
|
2016-03-26 00:21:59 +03:00
|
|
|
#ifdef CONFIG_KASAN
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_KASAN ((slab_flags_t __force)0x08000000U)
|
2016-03-26 00:21:59 +03:00
|
|
|
#else
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_KASAN 0
|
2016-03-26 00:21:59 +03:00
|
|
|
#endif
|
|
|
|
|
2007-10-16 12:25:52 +04:00
|
|
|
/* The following flags affect the page allocator grouping pages by mobility */
|
2017-11-16 04:32:18 +03:00
|
|
|
/* Objects are reclaimable */
|
2017-11-16 04:32:21 +03:00
|
|
|
#define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U)
|
2007-10-16 12:25:52 +04:00
|
|
|
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
|
2007-07-17 15:03:22 +04:00
|
|
|
/*
|
|
|
|
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
|
|
|
|
*
|
|
|
|
* Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
|
|
|
|
*
|
|
|
|
* ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
|
|
|
|
* Both make kfree a no-op.
|
|
|
|
*/
|
|
|
|
#define ZERO_SIZE_PTR ((void *)16)
|
|
|
|
|
2007-07-20 23:13:20 +04:00
|
|
|
#define ZERO_OR_NULL_PTR(x) ((unsigned long)(x) <= \
|
2007-07-17 15:03:22 +04:00
|
|
|
(unsigned long)ZERO_SIZE_PTR)
|
|
|
|
|
2015-02-14 01:39:42 +03:00
|
|
|
#include <linux/kasan.h>
|
2012-06-13 19:24:57 +04:00
|
|
|
|
2012-12-19 02:22:34 +04:00
|
|
|
struct mem_cgroup;
|
2006-12-13 11:34:23 +03:00
|
|
|
/*
|
|
|
|
* struct kmem_cache related prototypes
|
|
|
|
*/
|
|
|
|
void __init kmem_cache_init(void);
|
2015-11-06 05:44:59 +03:00
|
|
|
bool slab_is_available(void);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-12-01 00:04:32 +03:00
|
|
|
extern bool usercopy_fallback;
|
|
|
|
|
2018-04-06 02:20:37 +03:00
|
|
|
struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
|
|
|
|
unsigned int align, slab_flags_t flags,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
|
|
|
void (*ctor)(void *));
|
|
|
|
struct kmem_cache *kmem_cache_create_usercopy(const char *name,
|
2018-04-06 02:20:37 +03:00
|
|
|
unsigned int size, unsigned int align,
|
|
|
|
slab_flags_t flags,
|
2018-04-06 02:21:31 +03:00
|
|
|
unsigned int useroffset, unsigned int usersize,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
|
|
|
void (*ctor)(void *));
|
2006-12-13 11:34:23 +03:00
|
|
|
void kmem_cache_destroy(struct kmem_cache *);
|
|
|
|
int kmem_cache_shrink(struct kmem_cache *);
|
2015-02-13 01:59:32 +03:00
|
|
|
|
|
|
|
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
|
|
|
|
void memcg_deactivate_kmem_caches(struct mem_cgroup *);
|
2006-12-13 11:34:23 +03:00
|
|
|
|
2007-05-07 01:49:57 +04:00
|
|
|
/*
|
|
|
|
* Please use this macro to create slab caches. Simply specify the
|
|
|
|
* name of the structure and maybe some flags that are listed above.
|
|
|
|
*
|
|
|
|
* The alignment of the struct determines object alignment. If you
|
|
|
|
* f.e. add ____cacheline_aligned_in_smp to the struct declaration
|
|
|
|
* then the objects will be properly aligned in SMP configurations.
|
|
|
|
*/
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
|
|
|
#define KMEM_CACHE(__struct, __flags) \
|
|
|
|
kmem_cache_create(#__struct, sizeof(struct __struct), \
|
|
|
|
__alignof__(struct __struct), (__flags), NULL)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To whitelist a single field for copying to/from usercopy, use this
|
|
|
|
* macro instead for KMEM_CACHE() above.
|
|
|
|
*/
|
|
|
|
#define KMEM_CACHE_USERCOPY(__struct, __flags, __field) \
|
|
|
|
kmem_cache_create_usercopy(#__struct, \
|
|
|
|
sizeof(struct __struct), \
|
|
|
|
__alignof__(struct __struct), (__flags), \
|
|
|
|
offsetof(struct __struct, __field), \
|
|
|
|
sizeof_field(struct __struct, __field), NULL)
|
2007-05-07 01:49:57 +04:00
|
|
|
|
2013-01-10 23:00:53 +04:00
|
|
|
/*
|
|
|
|
* Common kmalloc functions provided by all allocators
|
|
|
|
*/
|
|
|
|
void * __must_check __krealloc(const void *, size_t, gfp_t);
|
|
|
|
void * __must_check krealloc(const void *, size_t, gfp_t);
|
|
|
|
void kfree(const void *);
|
|
|
|
void kzfree(const void *);
|
2019-07-12 06:54:14 +03:00
|
|
|
size_t __ksize(const void *);
|
2013-01-10 23:00:53 +04:00
|
|
|
size_t ksize(const void *);
|
|
|
|
|
2016-06-07 21:05:33 +03:00
|
|
|
#ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
|
2018-01-11 01:48:22 +03:00
|
|
|
void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
|
|
|
|
bool to_user);
|
2016-06-07 21:05:33 +03:00
|
|
|
#else
|
2018-01-11 01:48:22 +03:00
|
|
|
static inline void __check_heap_object(const void *ptr, unsigned long n,
|
|
|
|
struct page *page, bool to_user) { }
|
2016-06-07 21:05:33 +03:00
|
|
|
#endif
|
|
|
|
|
2013-02-05 20:36:47 +04:00
|
|
|
/*
|
|
|
|
* Some archs want to perform DMA into kmalloc caches and need a guaranteed
|
|
|
|
* alignment larger than the alignment of a 64-bit integer.
|
|
|
|
* Setting ARCH_KMALLOC_MINALIGN in arch headers allows that.
|
|
|
|
*/
|
|
|
|
#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
|
|
|
|
#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
|
|
|
|
#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
|
|
|
|
#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
|
|
|
|
#else
|
|
|
|
#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
|
|
|
|
#endif
|
|
|
|
|
slab.h: sprinkle __assume_aligned attributes
The various allocators return aligned memory. Telling the compiler that
allows it to generate better code in many cases, for example when the
return value is immediately passed to memset().
Some code does become larger, but at least we win twice as much as we lose:
$ scripts/bloat-o-meter /tmp/vmlinux vmlinux
add/remove: 0/0 grow/shrink: 13/52 up/down: 995/-2140 (-1145)
An example of the different (and smaller) code can be seen in mm_alloc(). Before:
: 48 8d 78 08 lea 0x8(%rax),%rdi
: 48 89 c1 mov %rax,%rcx
: 48 89 c2 mov %rax,%rdx
: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
: 48 c7 80 48 03 00 00 movq $0x0,0x348(%rax)
: 00 00 00 00
: 31 c0 xor %eax,%eax
: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi
: 48 29 f9 sub %rdi,%rcx
: 81 c1 50 03 00 00 add $0x350,%ecx
: c1 e9 03 shr $0x3,%ecx
: f3 48 ab rep stos %rax,%es:(%rdi)
After:
: 48 89 c2 mov %rax,%rdx
: b9 6a 00 00 00 mov $0x6a,%ecx
: 31 c0 xor %eax,%eax
: 48 89 d7 mov %rdx,%rdi
: f3 48 ab rep stos %rax,%es:(%rdi)
So gcc's strategy is to do two possibly (but not really, of course)
unaligned stores to the first and last word, then do an aligned rep stos
covering the middle part with a little overlap. Maybe arches which do not
allow unaligned stores gain even more.
I don't know if gcc can actually make use of alignments greater than 8 for
anything, so one could probably drop the __assume_xyz_alignment macros and
just use __assume_aligned(8).
The increases in code size are mostly caused by gcc deciding to
opencode strlen() using the check-four-bytes-at-a-time trick when it
knows the buffer is sufficiently aligned (one function grew by 200
bytes). Now it turns out that many of these strlen() calls showing up
were in fact redundant, and they're gone from -next. Applying the two
patches to next-20151001 bloat-o-meter instead says
add/remove: 0/0 grow/shrink: 6/52 up/down: 244/-2140 (-1896)
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:56:48 +03:00
|
|
|
/*
|
|
|
|
* Setting ARCH_SLAB_MINALIGN in arch headers allows a different alignment.
|
|
|
|
* Intended for arches that get misalignment faults even for 64 bit integer
|
|
|
|
* aligned buffers.
|
|
|
|
*/
|
|
|
|
#ifndef ARCH_SLAB_MINALIGN
|
|
|
|
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
|
|
|
|
* pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
|
|
|
|
* aligned pointers.
|
|
|
|
*/
|
|
|
|
#define __assume_kmalloc_alignment __assume_aligned(ARCH_KMALLOC_MINALIGN)
|
|
|
|
#define __assume_slab_alignment __assume_aligned(ARCH_SLAB_MINALIGN)
|
|
|
|
#define __assume_page_alignment __assume_aligned(PAGE_SIZE)
|
|
|
|
|
2007-05-17 09:11:01 +04:00
|
|
|
/*
|
2013-01-10 23:14:19 +04:00
|
|
|
* Kmalloc array related definitions
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifdef CONFIG_SLAB
|
|
|
|
/*
|
|
|
|
* The largest kmalloc size supported by the SLAB allocators is
|
2007-05-17 09:11:01 +04:00
|
|
|
* 32 megabyte (2^25) or the maximum allocatable page order if that is
|
|
|
|
* less than 32 MB.
|
|
|
|
*
|
|
|
|
* WARNING: Its not easy to increase this value since the allocators have
|
|
|
|
* to do various tricks to work around compiler limitations in order to
|
|
|
|
* ensure proper constant folding.
|
|
|
|
*/
|
2007-06-24 04:16:43 +04:00
|
|
|
#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
|
|
|
|
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
|
2013-01-10 23:14:19 +04:00
|
|
|
#define KMALLOC_SHIFT_MAX KMALLOC_SHIFT_HIGH
|
2013-02-05 20:36:47 +04:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
2013-01-10 23:14:19 +04:00
|
|
|
#define KMALLOC_SHIFT_LOW 5
|
2013-02-05 20:36:47 +04:00
|
|
|
#endif
|
2013-06-14 23:55:13 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_SLUB
|
2013-01-10 23:14:19 +04:00
|
|
|
/*
|
2014-01-29 02:24:50 +04:00
|
|
|
* SLUB directly allocates requests fitting in to an order-1 page
|
|
|
|
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
|
2013-01-10 23:14:19 +04:00
|
|
|
*/
|
|
|
|
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
|
2017-01-11 03:57:27 +03:00
|
|
|
#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
|
2013-02-05 20:36:47 +04:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
2013-01-10 23:14:19 +04:00
|
|
|
#define KMALLOC_SHIFT_LOW 3
|
|
|
|
#endif
|
2013-02-05 20:36:47 +04:00
|
|
|
#endif
|
2007-05-17 09:11:01 +04:00
|
|
|
|
2013-06-14 23:55:13 +04:00
|
|
|
#ifdef CONFIG_SLOB
|
|
|
|
/*
|
2014-01-29 02:24:50 +04:00
|
|
|
* SLOB passes all requests larger than one page to the page allocator.
|
2013-06-14 23:55:13 +04:00
|
|
|
* No kmalloc array is necessary since objects of different sizes can
|
|
|
|
* be allocated from the same page.
|
|
|
|
*/
|
|
|
|
#define KMALLOC_SHIFT_HIGH PAGE_SHIFT
|
2017-01-11 03:57:27 +03:00
|
|
|
#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
|
2013-06-14 23:55:13 +04:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
|
|
|
#define KMALLOC_SHIFT_LOW 3
|
|
|
|
#endif
|
|
|
|
#endif
|
|
|
|
|
2013-01-10 23:14:19 +04:00
|
|
|
/* Maximum allocatable size */
|
|
|
|
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX)
|
|
|
|
/* Maximum size for which we actually use a slab cache */
|
|
|
|
#define KMALLOC_MAX_CACHE_SIZE (1UL << KMALLOC_SHIFT_HIGH)
|
|
|
|
/* Maximum order allocatable via the slab allocagtor */
|
|
|
|
#define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_MAX - PAGE_SHIFT)
|
2007-05-17 09:11:01 +04:00
|
|
|
|
2013-01-10 23:14:19 +04:00
|
|
|
/*
|
|
|
|
* Kmalloc subsystem.
|
|
|
|
*/
|
2013-02-05 20:36:47 +04:00
|
|
|
#ifndef KMALLOC_MIN_SIZE
|
2013-01-10 23:14:19 +04:00
|
|
|
#define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW)
|
2013-01-10 23:14:19 +04:00
|
|
|
#endif
|
|
|
|
|
2014-03-12 12:06:19 +04:00
|
|
|
/*
|
|
|
|
* This restriction comes from byte sized index implementation.
|
|
|
|
* Page size is normally 2^12 bytes and, in this case, if we want to use
|
|
|
|
* byte sized index which can represent 2^8 entries, the size of the object
|
|
|
|
* should be equal or greater to 2^12 / 2^8 = 2^4 = 16.
|
|
|
|
* If minimum size of kmalloc is less than 16, we use it as minimum object
|
|
|
|
* size and give up to use byte sized index.
|
|
|
|
*/
|
|
|
|
#define SLAB_OBJ_MIN_SIZE (KMALLOC_MIN_SIZE < 16 ? \
|
|
|
|
(KMALLOC_MIN_SIZE) : 16)
|
|
|
|
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:38 +03:00
|
|
|
/*
|
|
|
|
* Whenever changing this, take care of that kmalloc_type() and
|
|
|
|
* create_kmalloc_caches() still work as intended.
|
|
|
|
*/
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
enum kmalloc_cache_type {
|
|
|
|
KMALLOC_NORMAL = 0,
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:38 +03:00
|
|
|
KMALLOC_RECLAIM,
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
|
|
|
KMALLOC_DMA,
|
|
|
|
#endif
|
|
|
|
NR_KMALLOC_TYPES
|
|
|
|
};
|
|
|
|
|
2013-06-14 23:55:13 +04:00
|
|
|
#ifndef CONFIG_SLOB
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
extern struct kmem_cache *
|
|
|
|
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
|
|
|
|
|
|
|
|
static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
|
|
|
|
{
|
2013-01-10 23:12:17 +04:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz
Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches")
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com>
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:33:17 +03:00
|
|
|
/*
|
|
|
|
* The most common case is KMALLOC_NORMAL, so test for it
|
|
|
|
* with a single branch for both flags.
|
|
|
|
*/
|
|
|
|
if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0))
|
|
|
|
return KMALLOC_NORMAL;
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:38 +03:00
|
|
|
|
|
|
|
/*
|
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz
Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches")
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com>
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:33:17 +03:00
|
|
|
* At least one of the flags has to be set. If both are, __GFP_DMA
|
|
|
|
* is more important.
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:38 +03:00
|
|
|
*/
|
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz
Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches")
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com>
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:33:17 +03:00
|
|
|
return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM;
|
|
|
|
#else
|
|
|
|
return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL;
|
|
|
|
#endif
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
}
|
|
|
|
|
2013-01-10 23:14:19 +04:00
|
|
|
/*
|
|
|
|
* Figure out which kmalloc slab an allocation of a certain size
|
|
|
|
* belongs to.
|
|
|
|
* 0 = zero alloc
|
|
|
|
* 1 = 65 .. 96 bytes
|
2015-06-25 02:55:59 +03:00
|
|
|
* 2 = 129 .. 192 bytes
|
|
|
|
* n = 2^(n-1)+1 .. 2^n
|
2013-01-10 23:14:19 +04:00
|
|
|
*/
|
slab: make kmalloc_index() return "unsigned int"
kmalloc_index() return index into an array of kmalloc kmem caches,
therefore should be unsigned.
Space savings with SLUB on trimmed down .config:
add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
Function old new delta
calculate_sizes 924 983 +59
on_freelist 589 604 +15
init_cache_random_seq 122 127 +5
ext4_mb_init 1206 1210 +4
slab_pad_check.part 270 271 +1
cpu_partial_store 112 113 +1
usersize_show 28 27 -1
...
new_slab 1871 1837 -34
slab_order 204 - -204
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 02:20:22 +03:00
|
|
|
static __always_inline unsigned int kmalloc_index(size_t size)
|
2013-01-10 23:14:19 +04:00
|
|
|
{
|
|
|
|
if (!size)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (size <= KMALLOC_MIN_SIZE)
|
|
|
|
return KMALLOC_SHIFT_LOW;
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE <= 32 && size > 64 && size <= 96)
|
|
|
|
return 1;
|
|
|
|
if (KMALLOC_MIN_SIZE <= 64 && size > 128 && size <= 192)
|
|
|
|
return 2;
|
|
|
|
if (size <= 8) return 3;
|
|
|
|
if (size <= 16) return 4;
|
|
|
|
if (size <= 32) return 5;
|
|
|
|
if (size <= 64) return 6;
|
|
|
|
if (size <= 128) return 7;
|
|
|
|
if (size <= 256) return 8;
|
|
|
|
if (size <= 512) return 9;
|
|
|
|
if (size <= 1024) return 10;
|
|
|
|
if (size <= 2 * 1024) return 11;
|
|
|
|
if (size <= 4 * 1024) return 12;
|
|
|
|
if (size <= 8 * 1024) return 13;
|
|
|
|
if (size <= 16 * 1024) return 14;
|
|
|
|
if (size <= 32 * 1024) return 15;
|
|
|
|
if (size <= 64 * 1024) return 16;
|
|
|
|
if (size <= 128 * 1024) return 17;
|
|
|
|
if (size <= 256 * 1024) return 18;
|
|
|
|
if (size <= 512 * 1024) return 19;
|
|
|
|
if (size <= 1024 * 1024) return 20;
|
|
|
|
if (size <= 2 * 1024 * 1024) return 21;
|
|
|
|
if (size <= 4 * 1024 * 1024) return 22;
|
|
|
|
if (size <= 8 * 1024 * 1024) return 23;
|
|
|
|
if (size <= 16 * 1024 * 1024) return 24;
|
|
|
|
if (size <= 32 * 1024 * 1024) return 25;
|
|
|
|
if (size <= 64 * 1024 * 1024) return 26;
|
|
|
|
BUG();
|
|
|
|
|
|
|
|
/* Will never be reached. Needed because the compiler may complain */
|
|
|
|
return -1;
|
|
|
|
}
|
2013-06-14 23:55:13 +04:00
|
|
|
#endif /* !CONFIG_SLOB */
|
2013-01-10 23:14:19 +04:00
|
|
|
|
2016-05-20 03:10:55 +03:00
|
|
|
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
|
|
|
|
void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
|
2015-02-13 01:59:32 +03:00
|
|
|
void kmem_cache_free(struct kmem_cache *, void *);
|
2013-09-04 20:35:34 +04:00
|
|
|
|
2015-09-05 01:45:34 +03:00
|
|
|
/*
|
2016-03-16 00:54:03 +03:00
|
|
|
* Bulk allocation and freeing operations. These are accelerated in an
|
2015-09-05 01:45:34 +03:00
|
|
|
* allocator specific way to avoid taking locks repeatedly or building
|
|
|
|
* metadata structures unnecessarily.
|
|
|
|
*
|
|
|
|
* Note that interrupts must be enabled when calling these functions.
|
|
|
|
*/
|
|
|
|
void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
|
2015-11-21 02:57:58 +03:00
|
|
|
int kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
|
2015-09-05 01:45:34 +03:00
|
|
|
|
2016-03-16 00:54:00 +03:00
|
|
|
/*
|
|
|
|
* Caller must not use kfree_bulk() on memory not originally allocated
|
|
|
|
* by kmalloc(), because the SLOB allocator cannot handle this.
|
|
|
|
*/
|
|
|
|
static __always_inline void kfree_bulk(size_t size, void **p)
|
|
|
|
{
|
|
|
|
kmem_cache_free_bulk(NULL, size, p);
|
|
|
|
}
|
|
|
|
|
2013-09-04 20:35:34 +04:00
|
|
|
#ifdef CONFIG_NUMA
|
2016-05-20 03:10:55 +03:00
|
|
|
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
|
|
|
|
void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node) __assume_slab_alignment __malloc;
|
2013-09-04 20:35:34 +04:00
|
|
|
#else
|
|
|
|
static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return __kmalloc(size, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc(s, flags);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_TRACING
|
2016-05-20 03:10:55 +03:00
|
|
|
extern void *kmem_cache_alloc_trace(struct kmem_cache *, gfp_t, size_t) __assume_slab_alignment __malloc;
|
2013-09-04 20:35:34 +04:00
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
2016-05-20 03:10:55 +03:00
|
|
|
int node, size_t size) __assume_slab_alignment __malloc;
|
2013-09-04 20:35:34 +04:00
|
|
|
#else
|
|
|
|
static __always_inline void *
|
|
|
|
kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
|
|
|
int node, size_t size)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc_trace(s, gfpflags, size);
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_NUMA */
|
|
|
|
|
|
|
|
#else /* CONFIG_TRACING */
|
|
|
|
static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s,
|
|
|
|
gfp_t flags, size_t size)
|
|
|
|
{
|
2015-02-14 01:39:42 +03:00
|
|
|
void *ret = kmem_cache_alloc(s, flags);
|
|
|
|
|
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
|
|
|
ret = kasan_kmalloc(s, ret, size, flags);
|
2015-02-14 01:39:42 +03:00
|
|
|
return ret;
|
2013-09-04 20:35:34 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void *
|
|
|
|
kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
|
|
|
int node, size_t size)
|
|
|
|
{
|
2015-02-14 01:39:42 +03:00
|
|
|
void *ret = kmem_cache_alloc_node(s, gfpflags, node);
|
|
|
|
|
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
|
|
|
ret = kasan_kmalloc(s, ret, size, gfpflags);
|
2015-02-14 01:39:42 +03:00
|
|
|
return ret;
|
2013-09-04 20:35:34 +04:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_TRACING */
|
|
|
|
|
2016-05-20 03:10:55 +03:00
|
|
|
extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
|
2013-09-04 20:35:34 +04:00
|
|
|
|
|
|
|
#ifdef CONFIG_TRACING
|
2016-05-20 03:10:55 +03:00
|
|
|
extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
|
2013-09-04 20:35:34 +04:00
|
|
|
#else
|
|
|
|
static __always_inline void *
|
|
|
|
kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
|
|
|
|
{
|
|
|
|
return kmalloc_order(size, flags, order);
|
|
|
|
}
|
2013-01-10 23:14:19 +04:00
|
|
|
#endif
|
|
|
|
|
2013-09-04 20:35:34 +04:00
|
|
|
static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
unsigned int order = get_order(size);
|
|
|
|
return kmalloc_order_trace(size, flags, order);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kmalloc - allocate memory
|
|
|
|
* @size: how many bytes of memory are required.
|
2013-11-23 06:14:38 +04:00
|
|
|
* @flags: the type of memory to allocate.
|
2013-09-04 20:35:34 +04:00
|
|
|
*
|
|
|
|
* kmalloc is the normal method of allocating memory
|
|
|
|
* for objects smaller than page size in the kernel.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* The @flags argument may be one of the GFP flags defined at
|
|
|
|
* include/linux/gfp.h and described at
|
|
|
|
* :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>`
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* The recommended usage of the @flags is described at
|
2018-11-20 19:22:24 +03:00
|
|
|
* :ref:`Documentation/core-api/memory-allocation.rst <memory-allocation>`
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* Below is a brief outline of the most useful GFP flags
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %GFP_KERNEL
|
|
|
|
* Allocate normal kernel ram. May sleep.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %GFP_NOWAIT
|
|
|
|
* Allocation will not sleep.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %GFP_ATOMIC
|
|
|
|
* Allocation will not sleep. May use emergency pools.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %GFP_HIGHUSER
|
|
|
|
* Allocate memory from high memory on behalf of user.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
|
|
|
* Also it is possible to set different flags by OR'ing
|
|
|
|
* in one or more of the following additional @flags:
|
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %__GFP_HIGH
|
|
|
|
* This allocation has high priority and may use emergency pools.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %__GFP_NOFAIL
|
|
|
|
* Indicate that this allocation is in no way allowed to fail
|
|
|
|
* (think twice before using).
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %__GFP_NORETRY
|
|
|
|
* If memory is not immediately available,
|
|
|
|
* then give up at once.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %__GFP_NOWARN
|
|
|
|
* If allocation fails, don't issue any warnings.
|
2013-11-23 06:14:38 +04:00
|
|
|
*
|
2018-11-11 19:48:44 +03:00
|
|
|
* %__GFP_RETRY_MAYFAIL
|
|
|
|
* Try really hard to succeed the allocation but fail
|
|
|
|
* eventually.
|
2013-09-04 20:35:34 +04:00
|
|
|
*/
|
|
|
|
static __always_inline void *kmalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
if (__builtin_constant_p(size)) {
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
#ifndef CONFIG_SLOB
|
|
|
|
unsigned int index;
|
|
|
|
#endif
|
2013-09-04 20:35:34 +04:00
|
|
|
if (size > KMALLOC_MAX_CACHE_SIZE)
|
|
|
|
return kmalloc_large(size, flags);
|
|
|
|
#ifndef CONFIG_SLOB
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
index = kmalloc_index(size);
|
2013-09-04 20:35:34 +04:00
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
if (!index)
|
|
|
|
return ZERO_SIZE_PTR;
|
2013-09-04 20:35:34 +04:00
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
return kmem_cache_alloc_trace(
|
|
|
|
kmalloc_caches[kmalloc_type(flags)][index],
|
|
|
|
flags, size);
|
2013-09-04 20:35:34 +04:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
return __kmalloc(size, flags);
|
|
|
|
}
|
|
|
|
|
2013-01-10 23:14:19 +04:00
|
|
|
/*
|
|
|
|
* Determine size used for the nth kmalloc cache.
|
|
|
|
* return size or 0 if a kmalloc cache for that
|
|
|
|
* size does not exist
|
|
|
|
*/
|
2018-04-06 02:20:26 +03:00
|
|
|
static __always_inline unsigned int kmalloc_size(unsigned int n)
|
2013-01-10 23:14:19 +04:00
|
|
|
{
|
2013-06-14 23:55:13 +04:00
|
|
|
#ifndef CONFIG_SLOB
|
2013-01-10 23:14:19 +04:00
|
|
|
if (n > 2)
|
2018-04-06 02:20:26 +03:00
|
|
|
return 1U << n;
|
2013-01-10 23:14:19 +04:00
|
|
|
|
|
|
|
if (n == 1 && KMALLOC_MIN_SIZE <= 32)
|
|
|
|
return 96;
|
|
|
|
|
|
|
|
if (n == 2 && KMALLOC_MIN_SIZE <= 64)
|
|
|
|
return 192;
|
2013-06-14 23:55:13 +04:00
|
|
|
#endif
|
2013-01-10 23:14:19 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-04 20:35:34 +04:00
|
|
|
static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
#ifndef CONFIG_SLOB
|
|
|
|
if (__builtin_constant_p(size) &&
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
size <= KMALLOC_MAX_CACHE_SIZE) {
|
slab: make kmalloc_index() return "unsigned int"
kmalloc_index() return index into an array of kmalloc kmem caches,
therefore should be unsigned.
Space savings with SLUB on trimmed down .config:
add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
Function old new delta
calculate_sizes 924 983 +59
on_freelist 589 604 +15
init_cache_random_seq 122 127 +5
ext4_mb_init 1206 1210 +4
slab_pad_check.part 270 271 +1
cpu_partial_store 112 113 +1
usersize_show 28 27 -1
...
new_slab 1871 1837 -34
slab_order 204 - -204
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 02:20:22 +03:00
|
|
|
unsigned int i = kmalloc_index(size);
|
2013-09-04 20:35:34 +04:00
|
|
|
|
|
|
|
if (!i)
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
|
|
|
return kmem_cache_alloc_node_trace(
|
|
|
|
kmalloc_caches[kmalloc_type(flags)][i],
|
2013-09-04 20:35:34 +04:00
|
|
|
flags, node, size);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return __kmalloc_node(size, flags, node);
|
|
|
|
}
|
|
|
|
|
2015-02-13 01:59:20 +03:00
|
|
|
struct memcg_cache_array {
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct kmem_cache *entries[0];
|
|
|
|
};
|
|
|
|
|
2012-12-19 02:22:27 +04:00
|
|
|
/*
|
|
|
|
* This is the main placeholder for memcg-related information in kmem caches.
|
|
|
|
* Both the root cache and the child caches will have it. For the root cache,
|
|
|
|
* this will hold a dynamically allocated array large enough to hold
|
2014-01-24 03:53:06 +04:00
|
|
|
* information about the currently limited memcgs in the system. To allow the
|
|
|
|
* array to be accessed without taking any locks, on relocation we free the old
|
|
|
|
* version only after a grace period.
|
2012-12-19 02:22:27 +04:00
|
|
|
*
|
2017-02-23 02:41:17 +03:00
|
|
|
* Root and child caches hold different metadata.
|
2012-12-19 02:22:27 +04:00
|
|
|
*
|
2017-02-23 02:41:17 +03:00
|
|
|
* @root_cache: Common to root and child caches. NULL for root, pointer to
|
|
|
|
* the root cache for children.
|
2015-02-13 01:59:23 +03:00
|
|
|
*
|
2017-02-23 02:41:17 +03:00
|
|
|
* The following fields are specific to root caches.
|
|
|
|
*
|
|
|
|
* @memcg_caches: kmemcg ID indexed table of child caches. This table is
|
|
|
|
* used to index child cachces during allocation and cleared
|
|
|
|
* early during shutdown.
|
|
|
|
*
|
2017-02-23 02:41:24 +03:00
|
|
|
* @root_caches_node: List node for slab_root_caches list.
|
|
|
|
*
|
2017-02-23 02:41:17 +03:00
|
|
|
* @children: List of all child caches. While the child caches are also
|
|
|
|
* reachable through @memcg_caches, a child cache remains on
|
|
|
|
* this list until it is actually destroyed.
|
|
|
|
*
|
|
|
|
* The following fields are specific to child caches.
|
|
|
|
*
|
|
|
|
* @memcg: Pointer to the memcg this cache belongs to.
|
|
|
|
*
|
|
|
|
* @children_node: List node for @root_cache->children list.
|
2017-02-23 02:41:21 +03:00
|
|
|
*
|
|
|
|
* @kmem_caches_node: List node for @memcg->kmem_caches list.
|
2012-12-19 02:22:27 +04:00
|
|
|
*/
|
|
|
|
struct memcg_cache_params {
|
2017-02-23 02:41:17 +03:00
|
|
|
struct kmem_cache *root_cache;
|
2012-12-19 02:22:27 +04:00
|
|
|
union {
|
2017-02-23 02:41:17 +03:00
|
|
|
struct {
|
|
|
|
struct memcg_cache_array __rcu *memcg_caches;
|
2017-02-23 02:41:24 +03:00
|
|
|
struct list_head __root_caches_node;
|
2017-02-23 02:41:17 +03:00
|
|
|
struct list_head children;
|
2018-06-15 01:26:27 +03:00
|
|
|
bool dying;
|
2017-02-23 02:41:17 +03:00
|
|
|
};
|
2012-12-19 02:22:34 +04:00
|
|
|
struct {
|
|
|
|
struct mem_cgroup *memcg;
|
2017-02-23 02:41:17 +03:00
|
|
|
struct list_head children_node;
|
2017-02-23 02:41:21 +03:00
|
|
|
struct list_head kmem_caches_node;
|
2019-07-12 06:56:27 +03:00
|
|
|
struct percpu_ref refcnt;
|
2017-02-23 02:41:30 +03:00
|
|
|
|
2019-07-12 06:56:06 +03:00
|
|
|
void (*work_fn)(struct kmem_cache *);
|
2017-02-23 02:41:30 +03:00
|
|
|
union {
|
2019-07-12 06:56:06 +03:00
|
|
|
struct rcu_head rcu_head;
|
|
|
|
struct work_struct work;
|
2017-02-23 02:41:30 +03:00
|
|
|
};
|
2012-12-19 02:22:34 +04:00
|
|
|
};
|
2012-12-19 02:22:27 +04:00
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2012-12-19 02:22:34 +04:00
|
|
|
int memcg_update_all_caches(int num_memcgs);
|
|
|
|
|
2013-06-25 20:16:55 +04:00
|
|
|
/**
|
|
|
|
* kmalloc_array - allocate memory for an array.
|
|
|
|
* @n: number of elements.
|
|
|
|
* @size: element size.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
2006-06-23 13:03:48 +04:00
|
|
|
*/
|
2012-03-06 03:14:41 +04:00
|
|
|
static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2018-05-08 22:52:32 +03:00
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
if (unlikely(check_mul_overflow(n, size, &bytes)))
|
slob: initial NUMA support
This adds preliminary NUMA support to SLOB, primarily aimed at systems with
small nodes (tested all the way down to a 128kB SRAM block), whether
asymmetric or otherwise.
We follow the same conventions as SLAB/SLUB, preferring current node
placement for new pages, or with explicit placement, if a node has been
specified. Presently on UP NUMA this has the side-effect of preferring
node#0 allocations (since numa_node_id() == 0, though this could be
reworked if we could hand off a pfn to determine node placement), so
single-CPU NUMA systems will want to place smaller nodes further out in
terms of node id. Once a page has been bound to a node (via explicit node
id typing), we only do block allocations from partial free pages that have
a matching node id in the page flags.
The current implementation does have some scalability problems, in that all
partial free pages are tracked in the global freelist (with contention due
to the single spinlock). However, these are things that are being reworked
for SMP scalability first, while things like per-node freelists can easily
be built on top of this sort of functionality once it's been added.
More background can be found in:
http://marc.info/?l=linux-mm&m=118117916022379&w=2
http://marc.info/?l=linux-mm&m=118170446306199&w=2
http://marc.info/?l=linux-mm&m=118187859420048&w=2
and subsequent threads.
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 10:38:22 +04:00
|
|
|
return NULL;
|
2016-07-27 01:22:08 +03:00
|
|
|
if (__builtin_constant_p(n) && __builtin_constant_p(size))
|
2018-05-08 22:52:32 +03:00
|
|
|
return kmalloc(bytes, flags);
|
|
|
|
return __kmalloc(bytes, flags);
|
2012-03-06 03:14:41 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kcalloc - allocate memory for an array. The memory is set to zero.
|
|
|
|
* @n: number of elements.
|
|
|
|
* @size: element size.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
*/
|
|
|
|
static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmalloc_array(n, size, flags | __GFP_ZERO);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2006-10-04 13:15:25 +04:00
|
|
|
/*
|
|
|
|
* kmalloc_track_caller is a special version of kmalloc that records the
|
|
|
|
* calling function of the routine calling it for slab leak tracking instead
|
|
|
|
* of just the calling function (confusing, eh?).
|
|
|
|
* It's useful when the call to kmalloc comes from a widely-used standard
|
|
|
|
* allocator where we care about the real place the memory allocation
|
|
|
|
* request comes from.
|
|
|
|
*/
|
2008-08-19 21:43:25 +04:00
|
|
|
extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
|
2006-10-04 13:15:25 +04:00
|
|
|
#define kmalloc_track_caller(size, flags) \
|
2008-08-19 21:43:25 +04:00
|
|
|
__kmalloc_track_caller(size, flags, _RET_IP_)
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-11-16 04:32:29 +03:00
|
|
|
static inline void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
|
|
|
|
int node)
|
|
|
|
{
|
2018-05-08 22:52:32 +03:00
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
if (unlikely(check_mul_overflow(n, size, &bytes)))
|
2017-11-16 04:32:29 +03:00
|
|
|
return NULL;
|
|
|
|
if (__builtin_constant_p(n) && __builtin_constant_p(size))
|
2018-05-08 22:52:32 +03:00
|
|
|
return kmalloc_node(bytes, flags, node);
|
|
|
|
return __kmalloc_node(bytes, flags, node);
|
2017-11-16 04:32:29 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-05-01 19:58:38 +04:00
|
|
|
#ifdef CONFIG_NUMA
|
2008-08-19 21:43:25 +04:00
|
|
|
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
|
2006-12-07 07:32:30 +03:00
|
|
|
#define kmalloc_node_track_caller(size, flags, node) \
|
|
|
|
__kmalloc_node_track_caller(size, flags, node, \
|
2008-08-19 21:43:25 +04:00
|
|
|
_RET_IP_)
|
2006-12-13 11:34:23 +03:00
|
|
|
|
2006-12-07 07:32:30 +03:00
|
|
|
#else /* CONFIG_NUMA */
|
|
|
|
|
|
|
|
#define kmalloc_node_track_caller(size, flags, node) \
|
|
|
|
kmalloc_track_caller(size, flags)
|
2005-05-01 19:58:38 +04:00
|
|
|
|
2008-11-25 17:08:19 +03:00
|
|
|
#endif /* CONFIG_NUMA */
|
2006-01-08 12:01:45 +03:00
|
|
|
|
2007-07-17 15:03:29 +04:00
|
|
|
/*
|
|
|
|
* Shortcuts
|
|
|
|
*/
|
|
|
|
static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc(k, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kzalloc - allocate memory. The memory is set to zero.
|
|
|
|
* @size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
*/
|
|
|
|
static inline void *kzalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmalloc(size, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
2008-06-06 09:47:00 +04:00
|
|
|
/**
|
|
|
|
* kzalloc_node - allocate zeroed memory from a particular memory node.
|
|
|
|
* @size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
* @node: memory node from which to allocate
|
|
|
|
*/
|
|
|
|
static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmalloc_node(size, flags | __GFP_ZERO, node);
|
|
|
|
}
|
|
|
|
|
2014-10-10 02:26:00 +04:00
|
|
|
unsigned int kmem_cache_size(struct kmem_cache *s);
|
2009-06-12 15:03:06 +04:00
|
|
|
void __init kmem_cache_init_late(void);
|
|
|
|
|
2016-08-23 15:53:19 +03:00
|
|
|
#if defined(CONFIG_SMP) && defined(CONFIG_SLAB)
|
|
|
|
int slab_prepare_cpu(unsigned int cpu);
|
|
|
|
int slab_dead_cpu(unsigned int cpu);
|
|
|
|
#else
|
|
|
|
#define slab_prepare_cpu NULL
|
|
|
|
#define slab_dead_cpu NULL
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif /* _LINUX_SLAB_H */
|