This patch converts the seqiv IV generator to work with the new
AEAD interface where IV generators are just normal AEAD algorithms.
Full backwards compatibility is paramount at this point since
no users have yet switched over to the new interface. Nor can
they switch to the new interface until IV generation is fully
supported by it.
So this means we are adding two versions of seqiv alongside the
existing one. The first one is the one that will be used when
the underlying AEAD algorithm has switched over to the new AEAD
interface. The second one handles the current case where the
underlying AEAD algorithm still uses the old interface.
Both versions export themselves through the new AEAD interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the basic structure of the new AEAD type. Unlike
the current version, there is no longer any concept of geniv. IV
generation will still be carried out by wrappers but they will be
normal AEAD algorithms that simply take the IPsec sequence number
as the IV.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the helper crypto_aead_maxauthsize to remove the
need to directly dereference aead_alg internals by AEAD implementors.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch is the first step in the introduction of a new AEAD
alg type. Unlike normal conversions this patch only renames the
existing aead_alg structure because there are external references
to it.
Those references will be removed after this patch.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The primary user of AEAD, IPsec includes the IV in the AD in
most cases, except where it is implicitly authenticated by the
underlying algorithm.
The way it is currently implemented is a hack because we pass
the data in piecemeal and the underlying algorithms try to stitch
them back up into one piece.
This is why this patch is adding a new interface that allows a
single SG list to be passed in that contains everything so the
algorithm implementors do not have to stitch.
The new interface accepts a single source SG list and a single
destination SG list. Both must be laid out as follows:
AD, skipped data, plain/cipher text, ICV
The ICV is not present from the source during encryption and from
the destination during decryption.
For the top-level IPsec AEAD algorithm the plain/cipher text will
contain the generated (or received) IV.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the scatterwalk_ffwd helper which can create an
SG list that starts in the middle of an existing SG list. The
new list may either be part of the existing list or be a chain
that latches onto part of the existing list.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch simply adds the MD5 IV in the md5 header.
Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch converts the top-level aead interface to the new style.
All user-level AEAD interface code have been moved into crypto/aead.h.
The allocation/free functions have switched over to the new way of
allocating tfms.
This patch also removes the double indrection on setkey so the
indirection now exists only at the alg level.
Apart from these there are no user-visible changes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the helper crypto_aead_set_reqsize so that people
don't have to directly access the aead internals to set the reqsize.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds a new primitive crypto_grab_spawn which is meant
to replace crypto_init_spawn and crypto_init_spawn2. Under the
new scheme the user no longer has to worry about reference counting
the alg object before it is subsumed by the spawn.
It is pretty much an exact copy of crypto_grab_aead.
Prior to calling this function spawn->frontend and spawn->inst
must have been set.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add driver for NX-842 hardware on the PowerNV platform.
This allows the use of the 842 compression hardware coprocessor on
the PowerNV platform.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add "constraints" for the NX-842 driver. The constraints are used to
indicate what the current NX-842 platform driver is capable of. The
constraints tell the NX-842 user what alignment, min and max length, and
length multiple each provided buffers should conform to. These are
required because the 842 hardware requires buffers to meet specific
constraints that vary based on platform - for example, the pSeries
max length is much lower than the PowerNV max length.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add NX-842 frontend that allows using either the pSeries platform or
PowerNV platform driver (to be added by later patch) for the NX-842
hardware. Update the MAINTAINERS file to include the new filenames.
Update Kconfig files to clarify titles and descriptions, and correct
dependencies.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add 842-format software compression and decompression functions.
Update the MAINTAINERS 842 section to include the new files.
The 842 compression function can compress any input data into the 842
compression format. The 842 decompression function can decompress any
standard-format 842 compressed data - specifically, either a compressed
data buffer created by the 842 software compression function, or a
compressed data buffer created by the 842 hardware compressor (located
in PowerPC coprocessors).
The 842 compressed data format is explained in the header comments.
This is used in a later patch to provide a full software 842 compression
and decompression crypto interface.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In testmgr, struct pcomp_testvec takes a non-const 'params' field, which is
pointed to a const deflate_comp_params or deflate_decomp_params object. With
gcc-5 this incurs the following warnings:
In file included from ../crypto/testmgr.c:44:0:
../crypto/testmgr.h:28736:13: warning: initialization discards 'const' qualifier from pointer target type [-Wdiscarded-array-qualifiers]
.params = &deflate_comp_params,
^
../crypto/testmgr.h:28748:13: warning: initialization discards 'const' qualifier from pointer target type [-Wdiscarded-array-qualifiers]
.params = &deflate_comp_params,
^
../crypto/testmgr.h:28776:13: warning: initialization discards 'const' qualifier from pointer target type [-Wdiscarded-array-qualifiers]
.params = &deflate_decomp_params,
^
../crypto/testmgr.h:28800:13: warning: initialization discards 'const' qualifier from pointer target type [-Wdiscarded-array-qualifiers]
.params = &deflate_decomp_params,
^
Fix this by making the parameters pointer const and constifying the things
that use it.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently we're hiding mod->sig_ok under an ifdef in open code.
This patch adds a module_sig_ok accessor function and removes that
ifdef.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Now that all rng implementations have switched over to the new
interface, we can remove the old low-level interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch converts the DRBG implementation to the new low-level
rng interface.
This allows us to get rid of struct drbg_gen by using the new RNG
API instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephan Mueller <smueller@chronox.de>
This patch adds the helpers that allow the registration and removal
of multiple RNG algorithms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the function crypto_rng_set_entropy. It is only
meant to be used by testmgr when testing RNG implementations by
providing fixed entropy data in order to verify test vectors.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch converts the low-level crypto_rng interface to the
"new" style.
This allows existing implementations to be converted over one-
by-one. Once that is complete we can then remove the old rng
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
There is no reason why crypto_rng_reset should modify the seed
so this patch marks it as const. Since our algorithms don't
export a const seed function yet we have to go through some
contortions for now.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the new top-level function crypto_rng_generate
which generates random numbers with additional input. It also
extends the mid-level rng_gen_random function to take additional
data as input.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch converts the top-level crypto_rng to the "new" style.
It was the last algorithm type added before we switched over
to the new way of doing things exemplified by shash.
All users will automatically switch over to the new interface.
Note that this patch does not touch the low-level interface to
rng implementations.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The creation of a shadow copy is intended to only hold a short term
lock. But the drawback is that parallel users have a very similar DRBG
state which only differs by a high-resolution time stamp.
The DRBG will now hold a long term lock. Therefore, the lock is changed
to a mutex which implies that the DRBG can only be used in process
context.
The lock now guards the instantiation as well as the entire DRBG
generation operation. Therefore, multiple callers are fully serialized
when generating a random number.
As the locking is changed to use a long-term lock to avoid such similar
DRBG states, the entire creation and maintenance of a shadow copy can be
removed.
Signed-off-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
The macro BITMAP_LAST_WORD_MASK can be implemented without a conditional,
which will generally lead to slightly better generated code (221 bytes
saved for allmodconfig-GCOV_KERNEL, ~2k with GCOV_KERNEL). As a small
bonus, this also ensures that the nbits parameter is expanded exactly
once.
In BITMAP_FIRST_WORD_MASK, if start is signed gcc is technically allowed
to assume it is positive (or divisible by BITS_PER_LONG), and hence just
do the simple mask. It doesn't seem to use this, and even on an
architecture like x86 where the shift only depends on the lower 5 or 6
bits, and these bits are not affected by the signedness of the expression,
gcc still generates code to compute the C99 mandated value of start %
BITS_PER_LONG. So just use a mask explicitly, also for consistency with
BITMAP_LAST_WORD_MASK.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf(). If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).
So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired. Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.
This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst. For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.
In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small. We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.
In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.
In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.
[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KERN_CONT is nicely commented in kern_levels.h, but pr_cont() is now used
more often, and it lacks the comment stating what it is used for. It can
be confused as continuing the log level, but that is not its purpose. Its
purpose is to continue a line that had no newline enclosed. This should
be documented by pr_cont() as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel has orderly_poweroff which allows the kernel to initiate a
graceful shutdown of userspace, by running /sbin/poweroff. This adds
orderly_reboot that will cause userspace to shut itself down by calling
/sbin/reboot.
This will be used for shutdown initiated by a system controller on
platforms that do not use ACPI.
orderly_reboot() should be used when the system wants to allow userspace
to gracefully shut itself down. For cases where the system may imminently
catch on fire, the existing emergency_restart() provides an immediate
reboot without involving userspace.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of __check_region(), check_region(), and check_mem_region() are
gone. We got rid of the last user in v4.0-rc1. Remove them.
bloat-o-meter on x86_64 shows:
add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function old new delta
__kstrtab___check_region 15 - -15
__ksymtab___check_region 16 - -16
__check_region 71 - -71
Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 607ca46e97 ("UAPI: (Scripted) Disintegrate include/linux") left
behind some empty conditional blocks. Since they are useless and may
cause a reader to wonder whether something is missing, remove them.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides core functions for migration of zsmalloc. Migraion
policy is simple as follows.
for each size class {
while {
src_page = get zs_page from ZS_ALMOST_EMPTY
if (!src_page)
break;
dst_page = get zs_page from ZS_ALMOST_FULL
if (!dst_page)
dst_page = get zs_page from ZS_ALMOST_EMPTY
if (!dst_page)
break;
migrate(from src_page, to dst_page);
}
}
For migration, we need to identify which objects in zspage are allocated
to migrate them out. We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient. Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.
This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration. During migration, we cannot
move objects user are using due to data coherency between old object and
new object.
[akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From: Yigal Korman <yigal@plexistor.com>
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.
A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.
This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.
We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file. An xfstest is
attached to this patchset that shows the failure and the fix. (A DAX
patch will follow)
This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.
We define a new pfn_mkwrite and do not reuse page_mkwrite because
1 - The name ;-)
2 - But mainly because it would take a very long and tedious
audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
users. To make sure they do not now CRASH. For example current
DAX code (which this is for) would crash.
If we would want to reuse page_mkwrite, We will need to first
patch all users, so to not-crash-on-no-page. Then enable this
patch. But even if I did that I would not sleep so well at night.
Adding a new vector is the safest thing to do, and is not that
expensive. an extra pointer at a static function vector per driver.
Also the new vector is better for performance, because else we
Will call all current Kernel vectors, so to:
check-ha-no-page-do-nothing and return.
No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mempools keep allocated objects in reserved for situations when ordinary
allocation may not be possible to satisfy. These objects shouldn't be
accessed before they leave the pool.
This patch poison elements when get into the pool and unpoison when they
leave it. This will let KASan to detect use-after-free of mempool's
elements.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Chernenkov <drcheren@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most-used page->mapping helper -- page_mapping() -- has already uninlined.
Let's uninline also page_rmapping() and page_anon_vma(). It saves us
depending on configuration around 400 bytes in text:
text data bss dec hex filename
660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
659854 99254 410000 1169108 11d6d4 mm/built-in.o
I also tried to make code a bit more clean.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add trace events for cma_alloc() and cma_release().
The cma_alloc tracepoint is used both for successful and failed allocations,
in case of allocation failure pfn=-1UL is stored and printed.
Signed-off-by: Stefan Strogin <stefan.strogin@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mpn@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flip the flag test so that it is the simplest. No functional change, just
a small readability improvement:
No code changed:
# arch/x86/kernel/sys_x86_64.o:
text data bss dec hex filename
1551 24 0 1575 627 sys_x86_64.o.before
1551 24 0 1575 627 sys_x86_64.o.after
md5:
70708d1b1ad35cc891118a69dc1a63f9 sys_x86_64.o.before.asm
70708d1b1ad35cc891118a69dc1a63f9 sys_x86_64.o.after.asm
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.
[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All occurrences of mempools based on slab caches with object constructors
have been removed from the tree, so disallow creating them.
We can only dereference mem->ctor in mm/mempool.c without including
mm/slab.h in include/linux/mempool.h. So simply note the restriction,
just like the comment restricting usage of __GFP_ZERO, and warn on kernels
with CONFIG_DEBUG_VM() if such a mempool is allocated from.
We don't want to incur this check on every element allocation, so use
VM_BUG_ON().
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make 'min_size=<value>' be an option when mounting a hugetlbfs. This
option takes the same value as the 'size' option. min_size can be
specified without specifying size. If both are specified, min_size must
be less that or equal to size else the mount will fail. If min_size is
specified, then at mount time an attempt is made to reserve min_size
pages. If the reservation fails, the mount fails. At umount time, the
reserved pages are released.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs allocates huge pages from the global pool as needed. Even if
the global pool contains a sufficient number pages for the filesystem size
at mount time, those global pages could be grabbed for some other use. As
a result, filesystem huge page allocations may fail due to lack of pages.
Applications such as a database want to use huge pages for performance
reasons. hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages. However, the application
would like some reasonable assurance that allocations will not fail due to
a lack of huge pages. At application startup time, the application would
like to configure itself to use a specific number of huge pages. Before
starting, the application can check to make sure that enough huge pages
exist in the system global pools. However, there are no guarantees that
those pages will be available when needed by the application. What the
application wants is exclusive use of a subset of huge pages.
Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
specified number of pages will be available for use by the filesystem. At
mount time, this number of huge pages will be reserved for exclusive use
of the filesystem. If there is not a sufficient number of free pages, the
mount will fail. As pages are allocated to and freeed from the
filesystem, the number of reserved pages is adjusted so that the specified
minimum is maintained.
This patch (of 4):
Add a field to the subpool structure to indicate the minimimum number of
huge pages to always be used by this subpool. This minimum count includes
allocated pages as well as reserved pages. If the minimum number of pages
for the subpool have not been allocated, pages are reserved up to this
minimum. An additional field (rsv_hpages) is used to track the number of
pages reserved to meet this minimum size. The hstate pointer in the
subpool is convenient to have when reserving and unreserving the pages.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"deactivate_page" was created for file invalidation so it has too
specific logic for file-backed pages. So, let's change the name of the
function and date to a file-specific one and yield the generic name.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration. The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion. The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state. This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru. Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.
To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data. These maps are
created locked and read only. Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool. When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory. When the value is set to 1, allocations
succeed.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
THP uses tail page refcounting to be able to split huge pages at any time.
Tail page refcounting is not needed for other users of compound pages and
it's harmful because of overhead.
We try to exclude non-THP pages from tail page refcounting using
__compound_tail_refcounted() check. It excludes most common non-THP
compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
users -- drivers.
And it's not only about overhead.
Drivers might want to use compound pages to get refcounting semantics
suitable for mapping high-order pages to userspace. But tail page
refcounting breaks it.
Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
them. It means GUP pins would affect page_mapcount() for tail pages.
It's not a problem for THP, because it never maps tail pages. But unlike
THP, drivers map parts of compound pages with PTEs and it makes
page_mapcount() be called for tail pages.
In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
such pages. But, I'm not aware about anything which can lead to crash or
other serious misbehaviour.
Since currently all THP pages are anonymous and all drivers pages are not,
we can fix the __compound_tail_refcounted() check by requiring PageAnon()
to enable tail page refcounting.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we take a naive approach to page flags on compound pages - we
set the flag on the page without consideration if the flag makes sense
for tail page or for compound page in general. This patchset try to
sort this out by defining per-flag policy on what need to be done if
page-flag helper operate on compound page.
The last patch in the patchset also sanitizes usege of page->mapping for
tail pages. We don't define the meaning of page->mapping for tail
pages. Currently it's always NULL, which can be inconsistent with head
page and potentially lead to problems.
For now I caught one case of illegal usage of page flags or ->mapping:
sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
It leads to setting dirty bit on tail pages and access to tail_page's
->mapping. I don't see any bad behaviour caused by this, but worth
fixing anyway.
This patchset makes more sense if you take my THP refcounting into
account: we will see more compound pages mapped with PTEs and we need to
define behaviour of flags on compound pages to avoid bugs.
This patch (of 16):
We have page-flags helper function declarations/definitions spread over
several header files. Let's consolidate them in <linux/page-flags.h>.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>