Simply replace the whole kfifo.c and kfifo.h files with the new generic
version and fix the kerneldoc API template file.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the new version of the kfifo API files kfifo.c and kfifo.h.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are different types of a fifo which can not handled in C without a
lot of overhead. So i decided to write the API as a set of macros, which
is the only way to do a kind of template meta programming without C++.
This macros handles the different types of fifos in a transparent way.
There are a lot of benefits:
- Compile time handling of the different fifo types
- Better performance (a save put or get of an integer does only generate
9 assembly instructions on a x86)
- Type save
- Cleaner interface, the additional kfifo_..._rec() functions are gone
- Easier to use
- Less error prone
- Different types of fifos: it is now possible to define a int fifo or
any other type. See below for an example.
- Smaller footprint for none byte type fifos
- No need of creating a second hidden variable, like in the old DEFINE_KFIFO
The API was not changed.
There are now real in place fifos where the data space is a part of the
structure. The fifo needs now 20 byte plus the fifo space. Dynamic
assigned or allocated create a little bit more code.
Most of the macros code will be optimized away and simple generate a
function call. Only the really small one generates inline code.
Additionally you can now create fifos for any data type, not only the
"unsigned char" byte streamed fifos.
There is also a new kfifo_put and kfifo_get function, to handle a single
element in a fifo. This macros generates inline code, which is lit bit
larger but faster.
I know that this kind of macros are very sophisticated and not easy to
maintain. But i have all tested and it works as expected. I analyzed the
output of the compiler and for the x86 the code is as good as hand written
assembler code. For the byte stream fifo the generate code is exact the
same as with the current kfifo implementation. For all other types of
fifos the code is smaller before, because the interface is easier to use.
The main goal was to provide an API which is very intuitive, save and easy
to use. So linux will get now a powerful fifo API which provides all what
a developer needs. This will save in the future a lot of kernel space,
since there is no need to write an own implementation. Most of the device
driver developers need a fifo, and also deep kernel development will gain
benefit from this API.
Here are the results of the text section usage:
Example 1:
kfifo_put/_get kfifo_in/out current kfifo
dynamic allocated 0x000002a8 0x00000291 0x00000299
in place 0x00000291 0x0000026e 0x00000273
kfifo.c new old
text section size 0x00000be5 0x000008b2
As you can see, kfifo_put/kfifo_get creates a little bit more code than
kfifo_in/kfifo_out, but it is much faster (the code is inline).
The code is complete hand crafted and optimized. The text section size is
as small as possible. You get all the fifo handling in only 3 kb. This
includes type safe fix size records, dynamic records and DMA handling.
This should be the final version. All requested features are implemented.
Note: Most features of this API doesn't have any users. All functions
which are not used in the next 9 months will be removed. So, please adapt
your drivers and other sources as soon as possible to the new API and post
it.
This are the features which are currently not used in the kernel:
kfifo_to_user()
kfifo_from_user()
kfifo_dma_....() macros
kfifo_esize()
kfifo_recsize()
kfifo_put()
kfifo_get()
The fixed size record elements, exclude "unsigned char" fifo's and the
variable size records fifo's
This patch:
User of the kernel fifo should never bypass the API and directly access
the fifo structure. Otherwise it will be very hard to maintain the API.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For consistency with other kfifo routines, return bool, not int.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds byte order autodetection (of PDP-11 and LE filesystems). No
attempt is made to detect big-endian filesystems -- were there any?
Tested with PDP-11 v7 filesystems and PC-IX maintenance floppy.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Newly mkfs-ed filesystems from Seventh Edition have last modification time
set to zero, but are otherwise perfectly valid.
Also, tighten up other sanity checks to filter out most filesystems with
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So that the module gets autoloaded when a v7 filesystem is mounted.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_to/from_user() returns the number of bytes remaining to be copied.
It never returns a negative value. The correct return code is -EFAULT and
not -EIO.
All the callers check for non-zero returns so that's Ok, but the return
code is passed to the user so we should fix this.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Kagstrom <simon.kagstrom@netinsight.net>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 51dcdfe ("parport: Use the PCI IRQ if offered") added IRQ support
for PCI parallel port devices handled by parport_pc, but turned it off for
parport_serial, despite a printk() message to the contrary.
Signed-off-by: Fr?d?ric Bri?re <fbriere@fbriere.net>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are missing the oops end marker for the exception based WARN implementation
in lib/bug.c. This is useful for logfile analysis tools.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a few issues with the exception based WARN implementation in
lib/bug.c:
- Inconsistent printk flags. The "cut here" line is printed at KERN_EMERG, so
the console and all logged in users see the single line:
------------[ cut here ]------------
for each WARN. Fix this so we print everything at KERN_WARNING to match the
kernel/panic.c version.
- The lib/bug.c WARN would print "Badness at". Change it to match the
kernel/panic.c version which prints "WARNING: at".
- Print the list of modules, similar to kernel/panic.c of modules, similar to
kernel/panic.c
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To keep panic_timeout accuracy when running under a hypervisor, the
current implementation only spins on long time (1 second) calls to mdelay.
That brings a good effect, but the problem is the keyboard LEDs don't
blink at all on that situation.
This patch changes to call to panic_blink_enter() between every mdelay and
keeps blinking in spite of long spin timer mode.
The time to call to mdelay is now 100ms. Even this change will keep
panic_timeout accuracy enough when running under a hypervisor.
Signed-off-by: TAMUKI Shoichi <tamuki@linet.gr.jp>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can clean up the work queue on this error path. This function is
called from afs_init().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DMA_xxBIT_MASK macros were marked as deprecated in June 2009. One more
year is long enough, I think.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was replaced with the DMA unamp state API (which can be used for
any bus).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures implement dma_is_consistent() in different ways (some
misinterpret the definition of API in DMA-API.txt). So it hasn't been so
useful for drivers. We have only one user of the API in tree. Unlikely
out-of-tree drivers use the API.
Even if we fix dma_is_consistent() in some architectures, it doesn't look
useful at all. It was invented long ago for some old systems that can't
allocate coherent memory at all. It's better to export only APIs that are
definitely necessary for drivers.
Let's remove this API.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This driver is the only user of dma_is_consistent(). We plan to remove this
API.
The driver uses the API in the following way:
BUG_ON(!dma_is_consistent(hostdata->dev, pScript) && L1_CACHE_BYTES < dma_get_cache_alignment());
The above code tries to see if L1_CACHE_BYTES is greater than
dma_get_cache_alignment() on sysmtes that can not allocate coherent memory
(some old systems can't).
James Bottomley exmplained that this is necesary because the driver packs the
set of mailboxes into a single coherent area and separates the different
usages by a L1 cache stride. So it's fatal if the dma
He also pointed out that we can kill this checking because we don't hit this
BUG_ON on all architectures that actually use the driver.
(akpm: stolen from the scsi tree because
dma-mapping-remove-dma_is_consistent-api.patch needs it)
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures that handle DMA-non-coherent memory need to set
ARCH_DMA_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
buffer doesn't share a cache with the others.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dma_get_cache_alignment returns the minimum DMA alignment. Architectures
defines it as ARCH_DMA_MINALIGN (formally ARCH_KMALLOC_MINALIGN). So we
can unify dma_get_cache_alignment implementations.
Note that some architectures implement dma_get_cache_alignment wrongly.
dma_get_cache_alignment() should return the minimum DMA alignment. So
fully-coherent architectures should return 1. This patch also fixes this
issue.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now each architecture has the own dma_get_cache_alignment implementation.
dma_get_cache_alignment returns the minimum DMA alignment. Architectures
define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
buffer is DMA-safe; the buffer doesn't share a cache with the others). So
we can unify dma_get_cache_alignment implementations.
This patch:
dma_get_cache_alignment() needs to know if an architecture defines
ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
alignment restriction). However, slab.h define ARCH_KMALLOC_MINALIGN if
architectures doesn't define it.
Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
(except for crypto).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 5753c082f6 ("powerpc/85xx: Kconfig
cleanup") menuconfig MPC85xx was replaced by FSL_SOC_BOOKE but some
references insider the code were not adjusted accordingly. This patch
adresses these missing pieces.
Signed-off-by: Christoph Egger <siccegge@cs.fau.de>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Scott Wood <scottwood@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
inspect the first map->page twice. This is correct, we want to find the
unused bits < offset in this bitmap block. Add the comment.
But it doesn't make any sense to stop the find_next_offset() loop when we
are looking into this map->page for the second time. We have already
already checked the bits >= offset during the first attempt, it is fine to
do this again, no matter if we succeed this time or not.
Remove this hard-to-understand code. It optimizes the very unlikely case
when we are going to fail, but slows down the more likely case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of
the same program. This is really bad for bash implementation.
Furthermore, many shell scripts assume that pid numbers will not be used
for some length of time.
Race Description:
A B
// pid == offset == n // pid == offset == n + 1
test_and_set_bit(offset, map->page)
test_and_set_bit(offset, map->page);
pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
// pid == n + 1 is freed (wait())
// Next fork()...
last = pid_ns->last_pid; // == n
pid = last + 1;
Code to reproduce it (Running multiple instances is more effective):
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
return (second + 32768 - first) % 32768;
}
int main(int argc, char* argv[]) {
int failed = 0;
pid_t last_pid = 0;
int i;
printf("%d\n", sizeof(pid_t));
for (i = 0; i < 10000000; ++i) {
if (i % 32786 == 0)
printf("Iter: %d\n", i/32768);
int child_exit_code = i % 256;
pid_t pid = fork();
if (pid == -1) {
fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
exit(1);
}
if (pid == 0) {
// Child
exit(child_exit_code);
} else {
// Parent
if (i > 0) {
int distance = PidDistance(last_pid, pid);
if (distance == 0 || distance > 30000) {
fprintf(stderr,
"Unexpected pid sequence: previous fork: pid=%d, "
"current fork: pid=%d for iteration=%d.\n",
last_pid, pid, i);
failed = 1;
}
}
last_pid = pid;
int status;
int reaped = wait(&status);
if (reaped != pid) {
fprintf(stderr,
"Wait return value: expected pid=%d, "
"got %d, iteration %d\n",
pid, reaped, i);
failed = 1;
} else if (WEXITSTATUS(status) != child_exit_code) {
fprintf(stderr,
"Unexpected exit status %x, iteration %d\n",
WEXITSTATUS(status), i);
failed = 1;
}
}
}
exit(failed);
}
Thanks to Ted Tso for the key ideas of this implementation.
Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MFGPT hardware may be set up only once, therefore
cs5535_mfgpt_free_timer() didn't re-set the timer's "avail" bit. However
if a timer is freed before it has actually been in use then it may be made
available again.
Signed-off-by: Jens Rottmann <JRottmann@LiPPERTEmbedded.de>
Acked-by: Andres Salomon <dilinger@queued.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jordan Crouse <jordan@cosmicpenguin.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a spin_unlock_irqrestore missing on the error path. Converting the
return to break leads to the spin_unlock_irqrestore at the end of the
function.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E1;
@@
* spin_lock_irqsave(E1,...);
<+... when != E1
if (...) {
... when != E1
* return ...;
}
...+>
* spin_unlock_irqrestore(E1,...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print out the reg spacing and size for spmi and smbios so BIOS developers
can make them consistent.
Also remove extra PFX on the duplicating path.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Corey Minyard <minyard@acm.org>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free the temporary info struct when we have duplicated ones.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Corey Minyard <minyard@acm.org>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a warning message generated by GCC, and also updates a web address
pointing to a pdf containing information.
CC [M] drivers/char/ipmi/ipmi_si_intf.o
drivers/char/ipmi/ipmi_si_intf.c: In function 'try_init_spmi':
drivers/char/ipmi/ipmi_si_intf.c:2016:8: warning: variable 'addr_space' set but not used
Signed-off-by: Sergey V. <sftp.mtuci@gmail.com>
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the entire fs/proc directory is conditionally included based on
CONFIG_PROC_FS, it's redundant to check that same variable within that
directory.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If signalfd is used to consume a signal generated by a POSIX interval
timer or POSIX message queue, the ssi_int field does not reflect the data
(sigevent->sigev_value) supplied to timer_create(2) or mq_notify(3). (The
ssi_ptr field, however, is filled in.)
This behavior differs from signalfd's treatment of sigqueue-generated
signals -- see the default case in signalfd_copyinfo. It also gives
results that differ from the case when a signal is handled conventionally
via a sigaction-registered handler.
So, set signalfd_siginfo->ssi_int in the remaining cases (__SI_TIMER,
__SI_MESGQ) where ssi_ptr is set.
akpm: a non-back-compatible change. Merge into -stable to minimise the
number of kernels which are in the field and which miss this feature.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exit_ptrace() takes tasklist_lock unconditionally. We need this lock to
avoid the race with ptrace_traceme(), it acts as a barrier.
Change its caller, forget_original_parent(), to call exit_ptrace() under
tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if
needed.
This allows us to add the fastpath list_empty(ptraced) check. In the
likely no-tracees case exit_ptrace() just returns and we avoid the lock()
+ unlock() sequence.
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> suggested to add this
check, and he reports that this change adds about 11% improvement in some
tests.
Suggested-and-tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
and zid can be calculated from zone. So remove it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
use it at all.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
It mean shrink_zone() only scan 32 pages and immediately return even if
it doesn't reclaim any pages.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, memory cgroup increments css(cgroup subsys state)'s reference count
per a charged page. And the reference count is kept until the page is
uncharged. But this has 2 bad effect.
1. Because css_get/put calls atomic_inc()/dec, heavy call of them
on large smp will not scale well.
2. Because css's refcnt cannot be in a state as "ready-to-release",
cgroup's notify_on_release handler can't work with memcg.
3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.
This has been a problem since the 1st merge of memcg.
This is a trial to remove css's refcnt per a page. Even if we remove
refcnt, pre_destroy() does enough synchronization as
- check res->usage == 0.
- check no pages on LRU.
This patch removes css's refcnt per page. Even after this patch, at the
1st look, it seems css_get() is still called in try_charge().
But the logic is.
- If a memcg of mm->owner is cached one, consume_stock() will work.
At success, return immediately.
- If consume_stock returns false, css_get() is called and go to
slow path which may be blocked. At the end of slow path,
css_put() is called and restart from the start if necessary.
So, in the fast path, we don't call css_get() and can avoid access to
shared counter. This patch can make the most possible case fast.
Here is a result of multi-threaded page fault benchmark.
[Before]
25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c
9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm <=====(*)
7.83% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
5.38% multi-fault-all [kernel.kallsyms] [k] __css_put
5.29% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
4.92% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
4.24% multi-fault-all [kernel.kallsyms] [k] up_read
3.53% multi-fault-all [kernel.kallsyms] [k] css_put
2.11% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
1.76% multi-fault-all [kernel.kallsyms] [k] __rmqueue
1.64% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
[After]
28.41% multi-fault-all [kernel.kallsyms] [k] clear_page_c
10.08% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
9.58% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
9.38% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.86% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
5.65% multi-fault-all [kernel.kallsyms] [k] up_read
2.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
2.64% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list
2.48% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
removes css_tryget() in it. (But yes, this is an extreme case.)
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the OOM killer scans task, it check a task is under memcg or
not when it's called via memcg's context.
But, as Oleg pointed out, a thread group leader may have NULL ->mm
and task_in_mem_cgroup() may do wrong decision. We have to use
find_lock_task_mm() in memcg as generic OOM-Killer does.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_charge_common() is always called with @mem = NULL, so it's
meaningless. This patch removes it.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
don't have to call them in task_in_mem_cgroup().
- *mz is not used in __mem_cgroup_uncharge_common().
- we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
after we've cleared PCG_MIGRATION of @oldpage.
- remove empty comment.
- remove redundant empty line in mem_cgroup_cache_charge().
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, for checking a memcg is under task-account-moving, we do css_tryget()
against mc.to and mc.from. But this is just complicating things. This
patch makes the check easier.
This patch adds a spinlock to move_charge_struct and guard modification of
mc.to and mc.from. By this, we don't have to think about complicated
races arount this not-critical path.
[balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_try_charge() has a big loop in it and seems to be hard to read.
Most of routines are for slow path. This patch moves codes out from the
loop and make it clear what's done.
Summary:
- refactoring a function to detect a memcg is under acccount move or not.
- refactoring a function to wait for the end of moving task acct.
- refactoring a main loop('s slow path) as a function and make it clear
why we retry or quit by return code.
- add fatal_signal_pending() check for bypassing charge loops.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's 11 months since we changed swap_map[] to indicates SWAP_HAS_CACHE.
Since that, memcg's swap accounting has been very stable and it seems
it can be maintained.
So, I'd like to remove EXPERIMENTAL from the config.
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup device whitelist code gets confused when trying to grant
permission to a disk partition that is not currently open. Part of
blkdev_open() includes __blkdev_get() on the whole disk.
Basically, the only ways to reliably allow a cgroup access to a partition
on a block device when using the whitelist are to 1) also give it access
to the whole block device or 2) make sure the partition is already open in
a different context.
The patch avoids the cgroup check for the whole disk case when opening a
partition.
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=589662
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original code didn't leave enough space for a NULL terminator. These
strings are copied with strcpy() into fixed length buffers in
cgroup_root_from_opts().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ben Blum <bblum@andrew.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>