Граф коммитов

22930 Коммитов

Автор SHA1 Сообщение Дата
Vincent Stehle eb0dc47ab6 genirq: Fix missing irq allocation affinity hint
The new affinity hint argument of __irq_domain_alloc_irqs() is missing in
irq_reserve_ipi(). Add it.

This fixes the following compilation error:

  kernel/irq/ipi.c: In function ‘irq_reserve_ipi’:
  kernel/irq/ipi.c:85:9: error: too few arguments to function ‘__irq_domain_alloc_irqs’
    virq = __irq_domain_alloc_irqs(domain, virq, nr_irqs, NUMA_NO_NODE,
           ^
Fixes: 06ee6d571f ("genirq: Add affinity hint to irq allocation")
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: linux-pci@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-19 10:49:47 +02:00
Ben Dooks 775be50626 clockevents: Make clockevents_subsys static
The clockevents_subsys struct is used for sysfs support and
is not declared or used outside the file it is defined in.
Fix the following warning by making it static:

kernel/time/clockevents.c:648:17: warning: symbol 'clockevents_subsys' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: linux-kernel@lists.codethink.co.uk
Link: http://lkml.kernel.org/r/1466178974-7105-1-git-send-email-ben.dooks@codethink.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-19 10:48:06 +02:00
Daniel Borkmann 858d68f102 bpf: bpf_event_entry_gen's alloc needs to be in atomic context
Should have been obvious, only called from bpf() syscall via map_update_elem()
that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this
must also be in GFP_ATOMIC, of course.

Fixes: 3b1efb196e ("bpf, maps: flush own entries on perf map release")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-16 22:03:39 -07:00
Linus Torvalds 8dcf5a80dd Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "The optimization for setting unbound worker affinity masks collided
  with recent scheduler changes triggering warning messages.

  This late pull request fixes the bug by removing the optimization"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix setting affinity of unbound worker threads
2016-07-16 06:36:55 +09:00
Daniel Borkmann 555c8a8623 bpf: avoid stack copy and use skb ctx for event output
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.

We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 14:23:56 -07:00
Daniel Borkmann 8e7a3920ac bpf, perf: split bpf_perf_event_output
Split the bpf_perf_event_output() helper as a preparation into
two parts. The new bpf_perf_event_output() will prepare the raw
record itself and test for unknown flags from BPF trace context,
where the __bpf_perf_event_output() does the core work. The
latter will be reused later on from bpf_event_output() directly.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 14:23:56 -07:00
Daniel Borkmann 7e3f977edd perf, events: add non-linear data support for raw records
This patch adds support for non-linear data on raw records. It
extends raw records to have one or multiple fragments that will
be written linearly into the ring slot, where each fragment can
optionally have a custom callback handler to walk and extract
complex, possibly non-linear data.

If a callback handler is provided for a fragment, then the new
__output_custom() will be used instead of __output_copy() for
the perf_output_sample() part. perf_prepare_sample() does all
the size calculation only once, so perf_output_sample() doesn't
need to redo the same work anymore, meaning real_size and padding
will be cached in the raw record. The raw record becomes 32 bytes
in size without holes; to not increase it further and to avoid
doing unnecessary recalculations in fast-path, we can reuse
next pointer of the last fragment, idea here is borrowed from
ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
path for PERF_SAMPLE_RAW minimal.

This facility is needed for BPF's event output helper as a first
user that will, in a follow-up, add an additional perf_raw_frag
to its perf_raw_record in order to be able to more efficiently
dump skb context after a linear head meta data related to it.
skbs can be non-linear and thus need a custom output function to
dump buffers. Currently, the skb data needs to be copied twice;
with the help of __output_custom() this work only needs to be
done once. Future users could be things like XDP/BPF programs
that work on different context though and would thus also have
a different callback function.

The few users of raw records are adapted to initialize their frag
data from the raw record itself, no change in behavior for them.
The code is based upon a PoC diff provided by Peter Zijlstra [1].

  [1] http://thread.gmane.org/gmane.linux.network/421294

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 14:23:56 -07:00
Rafael J. Wysocki 406f992e4a x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.

However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again.  Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.

First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid.  Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.

A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.

To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.

A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases.  It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 22:42:48 +02:00
Eric W. Biederman 726a4994b0 cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
Unprivileged users can't use hierarchies if they create them as they do not
have privilieges to the root directory.

Which means the only thing a hiearchy created by an unprivileged user
is good for is expanding the number of cgroup links in every css_set,
which is a DOS attack.

We could allow hierarchies to be created in namespaces in the initial
user namespace.  Unfortunately there is only a single namespace for
the names of heirarchies, so that is likely to create more confusion
than not.

So do the simple thing and restrict hiearchy creation to the initial
cgroup namespace.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 08:04:27 -04:00
Eric W. Biederman eedd0f4cbf cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
In most code paths involving cgroup migration cgroup_threadgroup_rwsem
is taken.  There are two exceptions:

- remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks
- vhost_attach_cgroups_work calls cgroup_attach_task_all

With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork
and copy_cgroup_ns will reference the same css_set from the process calling
fork.

Without such an interlock there process after fork could reference one
css_set from it's new cgroup namespace and another css_set from
task->cgroups, which semantically is nonsensical.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 07:56:38 -04:00
Eric W. Biederman 7bd8830875 cgroupns: Fix the locking in copy_cgroup_ns
If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep
valid splat.

In __cgroup_proc_write the lock ordering is:
     cgroup_mutex -- through cgroup_kn_lock_live
     cgroup_threadgroup_rwsem

In copy_process the guts of clone the lock ordering is:
     cgroup_threadgroup_rwsem -- through threadgroup_change_begin
     cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns

lockdep reports some a different call chains for the first ordering of
cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace.
This is most definitely deadlock potential under the right
circumstances.

Fix this by by skipping the cgroup_mutex and making the locking in
copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs
during fork under the cgroup_threadgroup_rwsem.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 07:56:32 -04:00
Thomas Gleixner 4df8374254 rcu: Convert rcutree to hotplug state machine
Straight forward conversion to the state machine. Though the question arises
whether this needs really all these state transitions to work.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.982013161@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:44 +02:00
Richard Weinberger 31487f8328 smp/cfd: Convert core to hotplug state machine
Install the callbacks via the state machine. They are installed at runtime so
smpcfd_prepare_cpu() needs to be invoked by the boot-CPU.

Signed-off-by: Richard Weinberger <richard@nod.at>
[ Added the dropped CPU dying case back in. ]
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.818376366@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:43 +02:00
Sebastian Andrzej Siewior e722d8daaf profile: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs. A lot of code is removed because
the for-loop is used and create_hash_tables() is removed since its purpose
is covered by the startup / teardown hooks.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.649867675@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:42 +02:00
Richard Cochran 24f73b9971 timers/core: Convert to hotplug state machine
When tearing down, call timers_dead_cpu() before notify_dead().
There is a hidden dependency between:

 - timers
 - block multiqueue
 - rcutree

If timers_dead_cpu() comes later than blk_mq_queue_reinit_notify()
that latter function causes a RCU stall.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.566790058@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:42 +02:00
Thomas Gleixner 27590dc17b hrtimer: Convert to hotplug state machine
Split out the clockevents callbacks instead of piggybacking them on
hrtimers.

This gets rid of a POST_DEAD user. See commit:

  54e88fad22 ("sched: Make sure timers have migrated before killing the migration_thread")

We just move the callback state to the proper place in the state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.485419196@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:37 +02:00
Linus Torvalds fa3a9f5744 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  m32r: fix build warning about putc
  mm: workingset: printk missing log level, use pr_info()
  mm: thp: refix false positive BUG in page_move_anon_rmap()
  mm: rmap: call page_check_address() with sync enabled to avoid racy check
  mm: thp: move pmd check inside ptl for freeze_page()
  vmlinux.lds: account for destructor sections
  gcov: add support for gcc version >= 6
  mm, meminit: ensure node is online before checking whether pages are uninitialised
  mm, meminit: always return a valid node from early_pfn_to_nid
  kasan/quarantine: fix bugs on qlist_move_cache()
  uapi: export lirc.h header
  madvise_free, thp: fix madvise_free_huge_pmd return value after splitting
  Revert "scripts/gdb: add documentation example for radix tree"
  Revert "scripts/gdb: add a Radix Tree Parser"
  scripts/gdb: Perform path expansion to lx-symbol's arguments
  scripts/gdb: add constants.py to .gitignore
  scripts/gdb: rebuild constants.py on dependancy change
  scripts/gdb: silence 'nothing to do' message
  kasan: add newline to messages
  mm, compaction: prevent VM_BUG_ON when terminating freeing scanner
2016-07-15 16:00:18 +09:00
Linus Torvalds d83a4c116c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a CPU hotplug related corruption of the load average that got
  introduced in this merge window"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Correct off by one bug in load migration calculation
2016-07-15 15:02:49 +09:00
Florian Meier d02038f972 gcov: add support for gcc version >= 6
Link: http://lkml.kernel.org/r/20160701130914.GA23225@styxhp
Signed-off-by: Florian Meier <Florian.Meier@informatik.uni-erlangen.de>
Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Steve Grubb 0b7a0fdb29 audit: fix whitespace in CWD record
Fix the whitespace in the CWD record

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
[PM: fixed subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-07-14 17:47:43 -04:00
Rik van Riel 553bf6bbfd sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
Paolo pointed out that irqs are already blocked when irqtime_account_irq()
is called. That means there is no reason to call local_irq_save/restore()
again.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:35 +02:00
Frederic Weisbecker 0cfdf9a198 sched/cputime: Clean up the old vtime gen irqtime accounting completely
Vtime generic irqtime accounting has been removed but there are a few
remnants to clean up:

* The vtime_accounting_cpu_enabled() check in irq entry was only used
  by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

* Without the vtime_accounting_cpu_enabled(), we no longer need to
  have a vtime_common_account_irq_enter() indirect function.

* Move vtime_account_irq_enter() implementation under
  CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

* The vtime_account_user() call was only used on irq entry for
  CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:35 +02:00
Rik van Riel b58c358405 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:34 +02:00
Rik van Riel 5743021831 sched/cputime: Count actually elapsed irq & softirq time
Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
and steal_account_process_ticks() to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks() to steal_account_process_time(), to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick() take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ Make nsecs_to_cputime64() actually return cputime64_t. ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:34 +02:00
Ingo Molnar cefef3a762 Merge branch 'sched/core' into timers/nohz, to avoid conflicts in upcoming patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:37:48 +02:00
Thomas Gleixner 7ee681b252 workqueue: Convert to state machine callbacks
Get rid of the prio ordering of the separate notifiers and use a proper state
callback pair.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:43 +02:00
Thomas Gleixner 00e16c3d68 perf/core: Convert to hotplug state machine
Actually a nice symmetric startup/teardown pair which fits properly into
the state machine concept. In the long run we should be able to invoke
the startup callback for the boot CPU via the state machine and get
rid of the init function which invokes it on the boot CPU.

Note: This comes actually before the perf hardware callbacks. In the notifier
model the hardware callbacks have a higher priority than the core
callback. But that's solely for CPU offline so that hardware migration of
events happens before the core is notified about the outgoing CPU.

With the symetric state array model we have the following ordering:

 UP:     core -> hardware
 DOWN:   hardware -> core

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153333.587514098@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:31 +02:00
Thomas Gleixner 6a4e24518c cpu/hotplug: Handle early registration gracefully
We switched the hotplug machinery to smpboot threads. Early registration of
hotplug callbacks, i.e. from do_pre_smp_initcalls(), happens before the
threads are initialized. Instead of moving the thread init, we simply handle
it in the hotplug code itself and invoke the function directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153332.896450738@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:25 +02:00
Greg Kroah-Hartman 6c71ee3b61 Third set of IIO new device support, features and cleanups for the 4.8 cycle.
New core features
 - Selection of the clock source for IIO timestamps.  This is done per device
   as it makes little sense to have events in one timebase and data timestamped
   on another.  Biggest reason for this is that we currently use a clock
   source which is non monotonic which can result in 'interesting' data sets.
   (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
    in an earlier version.)
 - MAINTAINERS add the git tree to the list for IIO.
 
 New device support + a kind of indirect staging graduation.
 * Broadcom iproc-static-adc
   - new driver
 * mcp4531
   - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
 * mpu6050
   - support the IC20608 6 axis motion tracking device
 * st-sensors
   - support the lis3l02dq + drop the lis3l02dq driver from staging.
   The general purpose driver is missing event support, but good to get
   rid of this driver which was rather long in the tooth.
 
 New driver features
 * ak8975
   - Add vid regulator support and refactor handling in general.
   - Allow a delay after enabling regulators.
   - Runtime and system PM.
 * bmg160
   - filter frequency control support.
 * bmp280
   - SPI device support.
   - EOC interrupt support for the BMP085
   - power management support.
   - supply regulator support.
   - reset gpio support
   - dt bindings for reset gpio and regulators.
   - of table to support device tree registration
 * max1363
   - Device tree bindings.
 * mcp4531
   - Device tree bindings.
 * st-pressure
   - temperature channels as part of triggered buffer (previously not due
   probably to alignment issues - see below).
   - lps22hb open drain interrupt support.
   - lps22hb temperature channel support
 
 Cleanups and reworkings.
 * numerous ADC drivers
   - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
   as to allow client bindings to find the device.
 * ak8975
   - Fix incorrect handling of missing regulator
   - make sure power is down and remove.
 * bmp280
   - read the calibration data only once as it doesn't change.
 * isl29125
   - Use a few macros to make code a touch more readable.
 * mma8452
   - fix a memory leak on error.
   - drop an unecessary bit of return value handling.
 * potentiometer kconfig
   - typo fix.
 * st-pressure
   - drop some uninformative default assignments of elements of the channel
   array structure (aids readability).
 * st-sensors
   - Harden interrupt handling considerably.  These are actually all using
   level interrupts, but at least two known boards have them wired to
   edge only interrupt chips.  Hence a slightly interesting bit of handling
   is needed in which we first allow for the easy option (level triggered) and
   secondly check the status registers before reenabling edge interrupts and
   fall back to a tight loop in the thread until we successfully clear the
   interrupt.  No harm is done if we never succeed in doing so.  It's an odd
   patch that has been through a lot of revisions to reach a consensus on how
   to handle what is basically broken hardware (which the previous defaults
   allowed to kind of work).
   - Fix alignment to defined storagebytes boundaries.
   - Ensure alignment of power of 2 byte boundaries.  This has always in theory
   been part of the ABI of IIO, but we missed a few that snuck in that need
   fixing.  The effect was minor as they were only followed by timestamp
   channels which were correctly aligned,
   - Add some docs to explain the gain calculations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIuBAABCAAYBQJXfBqnERxqaWMyM0BrZXJuZWwub3JnAAoJEFSFNJnE9BaIqjwP
 /0OJbr8kIa1i6+iCqCRCPCixdymd6k9wvjDaKSQoDeamen+8iKOLZNhXJJjOX8hd
 eCRMrCJbvY96Bl2Ll51TCEBb8R1xppCwwYIYylKhF9CL6N2ndapzWY0G4XZb6pc0
 e1JIa6uxynAAEsfplBskk4Ytf5PPHDOWER5WsTmxlZcTTAL9gLxIlii2Du0AmeN/
 tANVzwuvK07i5HHuZfYV2h2+OWDSlm4Y5rvE7t8keWpp6wnZ0XtiIw1WjkpR1OY7
 KiKGKRJMomFlp51hP9IKqc20Dweiaf3lHS7BDggvkB11VxyajQTcjvogxQ0BSPUv
 7PTHHlk8txgEUMqrDWP8x0TL97iNt3hiOZ0/rI3IZdFLC8pnibewnB+uHEGCH3tv
 bqToPtpJHjsIiGlCGVxvt8BRgqT5Qq7JT65hYS6774uFcQiPEvPDI44BDqUxaDUf
 /1WFM23VB4KJpx8JnL+nC8iu6DBnVPDWDKAsjGgc+ljnz3VRcSxWz5P0yMFZRMA2
 mbLiG2yiD4oD/LcI8FeZh9X50Irg09ElAWu07VRymrYMRfCYLXO07o5nZJ0bOqOB
 R+1MToYaHz2g6jJ+KGVC0Ul5EuULzymqH0CMbdjWnaD9AaoPuOKkNfUVBkzRK0t/
 TO/wLHm/qNbk+zGZHQFU15mH1Nn9leEJ/uCdnGqkRo7i
 =FxNN
 -----END PGP SIGNATURE-----

Merge tag 'iio-for-4.8c' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next

Jonathan writes:

Third set of IIO new device support, features and cleanups for the 4.8 cycle.

New core features
- Selection of the clock source for IIO timestamps.  This is done per device
  as it makes little sense to have events in one timebase and data timestamped
  on another.  Biggest reason for this is that we currently use a clock
  source which is non monotonic which can result in 'interesting' data sets.
  (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
   in an earlier version.)
- MAINTAINERS add the git tree to the list for IIO.

New device support + a kind of indirect staging graduation.
* Broadcom iproc-static-adc
  - new driver
* mcp4531
  - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
* mpu6050
  - support the IC20608 6 axis motion tracking device
* st-sensors
  - support the lis3l02dq + drop the lis3l02dq driver from staging.
  The general purpose driver is missing event support, but good to get
  rid of this driver which was rather long in the tooth.

New driver features
* ak8975
  - Add vid regulator support and refactor handling in general.
  - Allow a delay after enabling regulators.
  - Runtime and system PM.
* bmg160
  - filter frequency control support.
* bmp280
  - SPI device support.
  - EOC interrupt support for the BMP085
  - power management support.
  - supply regulator support.
  - reset gpio support
  - dt bindings for reset gpio and regulators.
  - of table to support device tree registration
* max1363
  - Device tree bindings.
* mcp4531
  - Device tree bindings.
* st-pressure
  - temperature channels as part of triggered buffer (previously not due
  probably to alignment issues - see below).
  - lps22hb open drain interrupt support.
  - lps22hb temperature channel support

Cleanups and reworkings.
* numerous ADC drivers
  - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
  as to allow client bindings to find the device.
* ak8975
  - Fix incorrect handling of missing regulator
  - make sure power is down and remove.
* bmp280
  - read the calibration data only once as it doesn't change.
* isl29125
  - Use a few macros to make code a touch more readable.
* mma8452
  - fix a memory leak on error.
  - drop an unecessary bit of return value handling.
* potentiometer kconfig
  - typo fix.
* st-pressure
  - drop some uninformative default assignments of elements of the channel
  array structure (aids readability).
* st-sensors
  - Harden interrupt handling considerably.  These are actually all using
  level interrupts, but at least two known boards have them wired to
  edge only interrupt chips.  Hence a slightly interesting bit of handling
  is needed in which we first allow for the easy option (level triggered) and
  secondly check the status registers before reenabling edge interrupts and
  fall back to a tight loop in the thread until we successfully clear the
  interrupt.  No harm is done if we never succeed in doing so.  It's an odd
  patch that has been through a lot of revisions to reach a consensus on how
  to handle what is basically broken hardware (which the previous defaults
  allowed to kind of work).
  - Fix alignment to defined storagebytes boundaries.
  - Ensure alignment of power of 2 byte boundaries.  This has always in theory
  been part of the ABI of IIO, but we missed a few that snuck in that need
  fixing.  The effect was minor as they were only followed by timestamp
  channels which were correctly aligned,
  - Add some docs to explain the gain calculations.
2016-07-14 12:05:29 +09:00
Linus Torvalds f97d10454e Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf and timer fixes from Ingo Molnar:
 "A fix for a posix CPU timers bug, and a perf printk message fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix bogus kernel printk, again

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix_cpu_timer: Exit early when process has been reaped
2016-07-14 05:44:47 +09:00
Thomas Gleixner e877bde234 Merge branch 'core/urgent' into smp/hotplug to pick up dependencies 2016-07-13 17:03:30 +02:00
Thomas Gleixner e1c4cde62b Merge branch 'core/rcu' into smp/hotplug to pick up dependencies 2016-07-13 17:03:17 +02:00
Thomas Gleixner 54f5449677 Merge branch 'timers/core' into smp/hotplug to pick up dependencies 2016-07-13 17:01:51 +02:00
Thomas Gleixner d60585c576 sched/core: Correct off by one bug in load migration calculation
The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
account that the function is now called from a thread running on the outgoing
CPU. As a result a cpu unplug leakes a load of 1 into the global load
accounting mechanism.

Fix it by adjusting for the currently running thread which calls
calc_load_migrate().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: rt@linutronix.de
Cc: shreyas@linux.vnet.ibm.com
Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13 14:58:20 +02:00
Thomas Gleixner a7c734140a cpu/hotplug: Keep enough storage space if SMP=n to avoid array out of bounds scribble
Xiaolong Ye reported lock debug warnings triggered by the following commit:

  8de4a0066106 ("perf/x86: Convert the core to the hotplug state machine")

The bug is the following: the cpuhp_bp_states[] array is cut short when
CONFIG_SMP=n, but the dynamically registered callbacks are stored nevertheless
and happily scribble outside of the array bounds...

We need to store them in case that the state is unregistered so we can invoke
the teardown function. That's independent of CONFIG_SMP. Make sure the array
is large enough.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: lkp@01.org
Cc: stable@vger.kernel.org
Cc: tipbuild@zytor.com
Fixes: cff7d378d3 "cpu/hotplug: Convert to a state machine for the control processor"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607122144560.4083@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13 09:29:39 +02:00
Paul Gortmaker a536a6e13e bpf: make inode code explicitly non-modular
The Kconfig currently controlling compilation of this code is:

init/Kconfig:config BPF_SYSCALL
init/Kconfig:   bool "Enable bpf() system call"

...meaning that it currently is not being built as a module by anyone.

Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.

Note that MODULE_ALIAS is a no-op for non-modular code.

We replace module.h with init.h since the file does use __init.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:52:43 -07:00
Alexander Popov a1b7b1a57b irqdomain: Fix irq_domain_alloc_irqs_recursive() error handling
If an irq_domain is auto-recursive and irq_domain_alloc_irqs_recursive()
for its parent has returned an error, then do return and avoid calling
irq_domain_free_irqs_recursive() uselessly, because:
- if domain->ops->alloc() had failed for an auto-recursive irq_domain,
   then irq_domain_free_irqs_recursive() had already been called;
- if domain->ops->alloc() had failed for a not auto-recursive irq_domain,
   then there is nothing to free at all.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1467505448-2850-1-git-send-email-alex.popov@linux.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-11 17:23:48 +02:00
Alexey Dobriyan 2c13ce8f6b posix_cpu_timer: Exit early when process has been reaped
Variable "now" seems to be genuinely used unintialized
if branch

	if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

is not taken and branch

	if (unlikely(sighand == NULL)) {

is taken. In this case the process has been reaped and the timer is marked as
disarmed anyway. So none of the postprocessing of the sample is
required. Return right away.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-11 17:20:12 +02:00
Ingo Molnar 44530d588e Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86"
This reverts commit 2c95afc1e8.

Stephane reported the following regression:

 > Since Andi added:
 >
 > commit 2c95afc1e8
 > Author: Andi Kleen <ak@linux.intel.com>
 > Date:   Thu Jun 9 06:14:38 2016 -0700
 >
 >    perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
 >
 > $ perf stat -e ref-cycles ls
 >   <not counted> ....
 >
 > fails systematically because the ref-cycles is now used by the
 > watchdog and given this is a system-wide pinned event, it monopolizes
 > the fixed counter 2 which is the only counter able to measure this event.

Since the next merge window is near, fix the regression for now
by reverting the commit.

Reported-by: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 20:58:36 +02:00
Daniel Bristot de Oliveira 748c7201e6 sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
Currently, a schedule while atomic error prints the stack trace to the
kernel log and the system continue running.

Although it is possible to collect the kernel log messages and analyze
it, often more information are needed. Furthermore, keep the system
running is not always the best choice. For example, when the preempt
count underflows the system will not stop to complain about scheduling
while atomic, so the kernel log can wrap around overwriting the first
stack trace, tuning the analysis even more challenging.

This patch uses the kernel.panic_on_warn sysctl to help out on these
more complex situations.

When kernel.panic_on_warn is set to 1, the kernel will panic() in the
schedule while atomic detection.

The default value of the sysctl is 0, maintaining the current behavior.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e8f7b80f353aa22c63bd8557208163989af8493d.1464983675.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 20:17:27 +02:00
Rafael J. Wysocki 4c0b6c10fb PM / hibernate: Image data protection during restoration
Make it possible to protect all pages holding image data during
hibernate image restoration by setting them read-only (so as to
catch attempts to write to those pages after image data have been
stored in them).

This adds overhead to image restoration code (it may cause large
page mappings to be split as a result of page flags changes) and
the errors it protects against should never happen in theory, so
the feature is only active after passing hibernate=protect_image
to the command line of the restore kernel.

Also it only is built if CONFIG_DEBUG_RODATA is set.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 02:12:10 +02:00
Rafael J. Wysocki d5f32af310 PM / hibernate: Add missing braces in __register_nosave_region()
One branch of an if/else statement in __register_nosave_region() is
formatted against the kernel coding style which causes the code to
look slightly odd.  To fix that, add missing braces to it.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:35 +02:00
Rafael J. Wysocki ef96f639ea PM / hibernate: Clean up comments in snapshot.c
Many comments in kernel/power/snapshot.c do not follow the general
comment formatting rules.  They look odd, some of them are outdated
too, some are hard to parse and generally difficult to understand.

Clean them up to make them easier to comprehend.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:26 +02:00
Rafael J. Wysocki efd5a85242 PM / hibernate: Clean up function headers in snapshot.c
The formatting of some function headers in kernel/power/snapshot.c
is not consistent with the general kernel coding style and with the
formatting of some other function headers in the same file.

Make all of them follow the same formatting convention.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:20 +02:00
Rafael J. Wysocki 2f88e41a22 PM / hibernate: Add missing braces in hibernate_setup()
Make hibernate_setup() follow the coding style more closely by adding
some missing braces to the if () statement in it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:13 +02:00
Zhao Lei 277a13e4f0 sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
In current code, we can get cpuacct data from several files,
but each file has various limitations.

For example:

 - We can get CPU usage in user and kernel mode via cpuacct.stat,
   but we can't get detailed data about each CPU.

 - We can get each CPU's kernel mode usage in cpuacct.usage_percpu_sys,
   but we can't get user mode usage data at the same time.

This patch introduces cpuacct.usage_all, to show all detailed CPU
accounting data together:

 # cat cpuacct.usage_all
 cpu user system
 0 3809760299 5807968992
 1 3250329855 454612211
 ..

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/7744460969edd7caaf0e903592ee52353ed9bdd6.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Zhao Lei 8e546bfafb sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
In cpuacct_stats_show() we currently we have copies of similar code,
for each cpustat(system/user) variant.

Use a loop instead to consolidate the code. This will also work better
if we extend the CPUACCT_STAT_NSTATS type.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b0597d4224655e9f333f1a6224ed9654c7d7d36a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Zhao Lei 9acacc2ac5 sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
These two types have similar function, no need to separate them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/436748885270d64363c7dc67167507d486c2057a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Alexei Starovoitov 606274c5ab bpf: introduce bpf_get_current_task() helper
over time there were multiple requests to access different data
structures and fields of task_struct current, so finally add
the helper to access 'current' as-is. Tracing bpf programs will do
the rest of walking the pointers via bpf_probe_read().
Note that current can be null and bpf program has to deal it with,
but even dumb passing null into bpf_probe_read() is still safe.

Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 00:00:16 -04:00
Rafael J. Wysocki 63f9ccb895 Merge back earlier suspend/hibernation changes for v4.8. 2016-07-08 23:14:17 +02:00
Linus Torvalds 369da7fc6d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two load-balancing fixes for cgroups-intense workloads"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
  sched/fair: Fix effective_load() to consistently use smoothed load
2016-07-08 09:04:34 -07:00
Ingo Molnar 9e7f7f5425 Merge branch 'x86/mm' into x86/boot, to pick up dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08 17:27:47 +02:00
Ingo Molnar 4b4b20852d Merge branch 'timers/fast-wheel' into timers/core 2016-07-07 10:35:28 +02:00
Anna-Maria Gleixner f00c0afdfa timers: Implement optimization for same expiry time in mod_timer()
The existing optimization for same expiry time in mod_timer() checks whether
the timer expiry time is the same as the new requested expiry time. In the old
timer wheel implementation this does not take the slack batching into account,
neither does the new implementation evaluate whether the new expiry time will
requeue the timer to the same bucket.

To optimize that, we can calculate the resulting bucket and check if the new
expiry time is different from the current expiry time. This calculation
happens outside the base lock held region. If the resulting bucket is the same
we can avoid taking the base lock and requeueing the timer.

If the timer needs to be requeued then we have to check under the base lock
whether the base time has changed between the lockless calculation and taking
the lock. If it has changed we need to recalculate under the lock.

This optimization takes effect for timers which are enqueued into the less
granular wheel levels (1 and above). With a simple test case the functionality
has been verified:

            Before        After
 Match:       5.5%        86.6%
 Requeue:    94.5%        13.4%
 Recalc:                  <0.01%

In the non optimized case the timer is requeued in 94.5% of the cases. With
the index optimization in place the requeue rate drops to 13.4%. The case
where the lockless index calculation has to be redone is less than 0.01%.

With a real world test case (networking) we observed the following changes:

            Before        After
 Match:      97.8%        99.7%
 Requeue:     2.2%         0.3%
 Recalc:                  <0.001%

That means two percent fewer lock/requeue/unlock operations done in one of
the hot path use cases of timers.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Anna-Maria Gleixner ffdf047728 timers: Split out index calculation
For further optimizations we need to seperate index calculation
from queueing. No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Thomas Gleixner 4e85876a9d timers: Only wake softirq if necessary
With the wheel forwading in place and with the HZ=1000 4ms folding we can
avoid running the softirq at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Thomas Gleixner a683f390b9 timers: Forward the wheel clock whenever possible
The wheel clock is stale when a CPU goes into a long idle sleep. This has the
side effect that timers which are queued end up in the outer wheel levels.
That results in coarser granularity.

To solve this, we keep track of the idle state and forward the wheel clock
whenever possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Thomas Gleixner ff00673292 timers/nohz: Remove pointless tick_nohz_kick_tick() function
This was a failed attempt to optimize the timer expiry in idle, which was
disabled and never revisited. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.431073782@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:10 +02:00
Anna-Maria Gleixner 236968383c timers: Optimize collect_expired_timers() for NOHZ
After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
There might be expired timers so the current code loops and checks the expired
buckets for timers. This can take quite some time for long NOHZ idle periods.

The pending bitmask in the timer base allows us to do a quick search for the
next expiring timer and therefore a fast forward of the base time which
prevents pointless long lasting loops.

For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:10 +02:00
Anna-Maria Gleixner 73420fea80 timers: Move __run_timers() function
Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
in preparation for __run_timers() NOHZ optimization.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner 53bf837b78 timers: Remove set_timer_slack() leftovers
We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrew F. Davis <afd@ti.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jaehoon Chung <jh80.chung@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Nyman <mathias.nyman@intel.com>
Cc: Pali Rohár <pali.rohar@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner 500462a9de timers: Switch to a non-cascading wheel
The current timer wheel has some drawbacks:

1) Cascading:

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration. (They are used as timeout safeguards, not as
   real timers to measure time.)

2) No fast lookup of the next expiring timer:

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to the current value of jiffies. As we
   have no way to find the next expiring timer fast, the code loops linearly
   and increments the base time one by one and checks for expired timers
   in each step. This causes unbound overhead spikes exactly in the moment
   when we should wake up as fast as possible.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.

To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.

Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.

So for HZ=250 we end up with the following granularity levels:

 Level Offset   Granularity                  Range
     0      0          4 ms                 0 ms -        252 ms
     1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
     2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
     3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
     4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
     5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
     6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
     7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided completely

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefore can
   look up the next expiring timer very fast and O(1).

A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.

This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner 494af3ed78 timers: Give a few structs and members proper names
Some of the names in the internal implementation of the timer code
are not longer correct and others are simply too long to type.

Clean it up before we switch the wheel implementation over to
the new scheme.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:08 +02:00
Thomas Gleixner 2b1ecc3d1a signals: Use hrtimer for sigtimedwait()
We've converted most timeout related syscalls to hrtimers, but
sigtimedwait() did not get this treatment.

Convert it so we get a reasonable accuracy and remove the
user space exposure to the timer wheel properties.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.787164909@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:07 +02:00
Thomas Gleixner 177ec0a0a5 timers: Remove the deprecated mod_timer_pinned() API
We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:06 +02:00
Thomas Gleixner e675447bda timers: Make 'pinned' a timer property
We want to move the timer migration logic from a 'push' to a 'pull' model.

Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().

The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.

This flag must be set at initialization time and the timer APIs
recognize the flag.

This patch:

 - Implements the new flag and associated new-style initialization
   methods

 - makes mod_timer() recognize new-style pinned timers,

 - and adds some migration helper facility to allow
   step by step conversion of old-style to new-style
   pinned timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:25:13 +02:00
Ingo Molnar 36e91aa262 Merge branch 'locking/arch-atomic' into locking/core, because the topic is ready
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 09:12:02 +02:00
Wei Yongjun 885885f6b8 locking/static_keys: Fix non static symbol Sparse warning
Fix the following sparse warnings:

  kernel/jump_label.c:473:23: warning:
   symbol 'jump_label_module_nb' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1466183980-8903-1-git-send-email-weiyj_lk@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 09:06:46 +02:00
Ingo Molnar 3ebe3bd8fb Merge branch 'perf/urgent' into perf/core, to pick up fixes before merging new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 08:58:23 +02:00
Mark Rutland 2c81a64770 perf/core: Fix pmu::filter_match for SW-led groups
The following commit:

  66eb579e66 ("perf: allow for PMU-specific event filtering")

added the pmu::filter_match() callback. This was intended to
avoid HW constraints on events from resulting in extremely
pessimistic scheduling.

However, pmu::filter_match() is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add() time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.

This can result in extremely sub-optimal event scheduling behaviour,
e.g. if running the following on a big.LITTLE platform:

$ taskset -c 0 ./perf stat \
 -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
 -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
 ls

     <not counted>      context-switches                                              (0.00%)
     <not counted>      armv8_cortex_a57/config=0x11/                                 (0.00%)
                24      context-switches                                              (37.36%)
          57589154      armv8_cortex_a53/config=0x11/                                 (37.36%)

Here the 'a53' event group was always eligible to be scheduled, but
the 'a57' group never eligible to be scheduled, as the task was always
affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
group was eligible, but the HW event failed at pmu::add() time,
resulting in ctx_flexible_sched_in giving up on scheduling further
groups with HW events.

One way of avoiding this is to check pmu::filter_match() on siblings
as well as the group leader. If any of these fail their
pmu::filter_match() call, we must skip the entire group before
attempting to add any events.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 66eb579e66 ("perf: allow for PMU-specific event filtering")
Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 08:57:57 +02:00
Juergen Gross ecb23dc6f2 xen: add steal_clock support on x86
The pv_time_ops structure contains a function pointer for the
"steal_clock" functionality used only by KVM and Xen on ARM. Xen on x86
uses its own mechanism to account for the "stolen" time a thread wasn't
able to run due to hypervisor scheduling.

Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen and remove the x86 Xen
hack.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2016-07-06 10:34:48 +01:00
Namhyung Kim a4a551b8f1 ftrace: Reduce size of function graph entries
Currently ftrace_graph_ent{,_entry} and ftrace_graph_ret{,_entry} struct
can have padding bytes at the end due to alignment in 64-bit data type.
As these data are recorded so frequently, those paddings waste
non-negligible space.  As the ring buffer maintains alignment properly
for each architecture, just to remove the extra padding using 'packed'
attribute.

  ftrace_graph_ent_entry:  24 -> 20
  ftrace_graph_ret_entry:  48 -> 44

Also I moved the 'overrun' field in struct ftrace_graph_ret to minimize
the padding in the middle.

Tested on x86_64 only.

Link: http://lkml.kernel.org/r/1467197808-13578-1-git-send-email-namhyung@kernel.org

Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 17:28:30 -04:00
Tom Zanussi 7ad8fb61c4 tracing: Have HIST_TRIGGERS select TRACING
The kbuild test robot reported a compile error if HIST_TRIGGERS was
enabled but nothing else that selected TRACING was configured in.

HIST_TRIGGERS should directly select it and not rely on anything else
to do it.

Link: http://lkml.kernel.org/r/57791866.8080505@linux.intel.com

Reported-by: kbuild test robot <fennguang.wu@intel.com>
Fixes: 7ef224d1d0 ("tracing: Add 'hist' event trigger command")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 15:49:01 -04:00
Wei Yongjun 67f20b0845 tracing: Using for_each_set_bit() to simplify trace_pid_write()
Using for_each_set_bit() to simplify the code.

Link: http://lkml.kernel.org/r/1467645004-11169-1-git-send-email-weiyj_lk@163.com

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 11:22:40 -04:00
Jisheng Zhang 5130213721 tick/broadcast-hrtimer: Set name of the ce_broadcast_hrtimer
This is to avoid the "null" name when we either

~ # cat /sys/devices/system/clockevents/broadcast/current_device
(null)

or

~ # cat /proc/timer_list
...
Tick Device: mode:     1
Broadcast device
Clock Event Device: (null)
...

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1467709071-3667-1-git-send-email-jszhang@marvell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-05 17:02:19 +02:00
Steven Rostedt (Red Hat) 501c237525 ftrace: Move toplevel init out of ftrace_init_tracefs()
Commit 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap like events
do") placed ftrace_init_tracefs into the instance creation, and encapsulated
the top level updating with an if conditional, as the top level only gets
updated at boot up. Unfortunately, this triggers section mismatch errors as
the init functions are called from a function that can be called later, and
the section mismatch logic is unaware of the if conditional that would
prevent it from happening at run time.

To make everyone happy, create a separate ftrace_init_tracefs_toplevel()
routine that only gets called by init functions, and this will be what calls
other init functions for the toplevel directory.

Link: http://lkml.kernel.org/r/20160704102139.19cbc0d9@gandalf.local.home

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap like events do")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 10:47:03 -04:00
Thomas Gleixner 4364e1a29b genirq/msi: Fix broken debug output
virq is not required to be the same for all msi descs. Use the base irq number
from the desc in the debug printk.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 15:32:25 +02:00
Rafael J. Wysocki 5d1191ab6c Merge back earlier cpufreq material for v4.8. 2016-07-04 13:21:43 +02:00
Thomas Gleixner 8658be133b Merge branch 'irq/for-block' into irq/core
Pull the irq affinity managing code which is in a seperate branch for block
developers to pull.
2016-07-04 12:26:05 +02:00
Christoph Hellwig 5e385a6ef3 genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors
This is lifted from the blk-mq code and adopted to use the affinity mask
concept just introduced in the irq handling code.  It tries to keep the
algorithm the same as the one current used by blk-mq, but improvements
like assining vectors on a per-node basis instead of just per sibling
are possible with this simple move and refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-7-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:14 +02:00
Thomas Gleixner 0972fa57f5 genirq/msi: Make use of affinity aware allocations
Allow the MSI code to provide affinity hints per MSI descriptor.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-6-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:14 +02:00
Thomas Gleixner 45ddcecbfa genirq: Use affinity hint in irqdesc allocation
Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:13 +02:00
Thomas Gleixner 06ee6d571f genirq: Add affinity hint to irq allocation
Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:13 +02:00
Thomas Gleixner 9c2555835b genirq: Introduce IRQD_AFFINITY_MANAGED flag
Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:13 +02:00
Thomas Gleixner b6140914fd genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
No user and we definitely don't want to grow one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-04 12:25:12 +02:00
Rafael J. Wysocki 307c5971c9 PM / hibernate: Recycle safe pages after image restoration
One of the memory bitmaps used by the hibernation image restoration
code is freed after the image has been loaded.

That is not quite efficient, though, because the memory pages used
for building that bitmap are known to be safe (ie. they were not
used by the image kernel before hibernation) and the arch-specific
code finalizing the image restoration may need them.  In that case
it needs to allocate those pages again via the memory management
subsystem, check if they are really safe again by consulting the
other bitmaps and so on.

To avoid that, recycle those pages by putting them into the global
list of known safe pages so that they can be given to the arch code
right away when necessary.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02 01:52:10 +02:00
Rafael J. Wysocki 6dbecfd345 PM / hibernate: Simplify mark_unsafe_pages()
Rework mark_unsafe_pages() to use a simpler method of clearing
all bits in free_pages_map and to set the bits for the "unsafe"
pages (ie. pages that were used by the image kernel before
hibernation) with the help of duplicate_memory_bitmap().

For this purpose, move the pfn_valid() check from mark_unsafe_pages()
to unpack_orig_pfns() where the "unsafe" pages are discovered.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02 01:52:09 +02:00
Rafael J. Wysocki 9c744481c0 PM / hibernate: Do not free preallocated safe pages during image restore
The core image restoration code preallocates some safe pages
(ie. pages that weren't used by the image kernel before hibernation)
for future use before allocating the bulk of memory for loading the
image data.  Those safe pages are then freed so they can be allocated
again (with the memory management subsystem's help).  That's done to
ensure that there will be enough safe pages for temporary data
structures needed during image restoration.

However, it is not really necessary to free those pages after they
have been allocated.  They can be added to the (global) list of
safe pages right away and then picked up from there when needed
without freeing.

That reduces the overhead related to using safe pages, especially
in the arch-specific code, so modify the code accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02 01:52:09 +02:00
Roger Lu 7b776af66d PM / suspend: show workqueue state in suspend flow
If freezable workqueue aborts suspend flow, show
workqueue state for debug purpose.

Signed-off-by: Roger Lu <roger.lu@mediatek.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-02 01:42:48 +02:00
Martin KaFai Lau 4a482f34af cgroup: bpf: Add bpf_skb_in_cgroup_proto
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2.  It is similar to the
feature added in netfilter:
commit c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")

The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.

Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:32:13 -04:00
Martin KaFai Lau 4ed8ec521e cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:30:38 -04:00
Martin KaFai Lau 1f3fe7ebf6 cgroup: Add cgroup_get_from_fd
Add a helper function to get a cgroup2 from a fd.  It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:30:38 -04:00
Daniel Borkmann 113214be7f bpf: refactor bpf_prog_get and type check into helper
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:00:47 -04:00
Daniel Borkmann 1aacde3d22 bpf: generally move prog destruction to RCU deferral
Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:

 - Set up a process with threads T1, T2 and T3
 - Let T1 set up a socket filter F1 that invokes another filter F2
   through a BPF map [tail call]
 - Let T1 trigger the socket filter via a unix domain socket write,
   don't wait for completion
 - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
 - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
 - Let T3 close the file descriptor for F2, dropping the reference
   count of F2 to 2
 - At this point, T1 should have looked up F2 from the map, but not
   finished executing it
 - Let T3 remove F2 from the BPF map, dropping the reference count of
   F2 to 1
 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
   the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
   via schedule_work()
 - At this point, the BPF program could be freed
 - BPF execution is still running in a freed BPF program

While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().

That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb5607035 ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.

Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:00:47 -04:00
Ingo Molnar 0de7611a10 timers/nohz: Capitalize 'CPU' consistently
While reviewing another patch I noticed that kernel/time/tick-sched.c
had a charmingly (confusingly, annoyingly) rich set of variants for
spelling 'CPU':

  cpu
  cpus
  CPU
  CPUs
  per CPU
  per-CPU
  per cpu

... sometimes these were mixed even within the same comment block!

Compress these variants down to a single consistent set of:

  CPU
  CPUs
  per-CPU

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-01 12:45:34 +02:00
Wei Jiangang 6168f8ed01 timers/nohz: Fix several typos
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Link: http://lkml.kernel.org/r/1467175910-2966-2-git-send-email-weijg.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-01 12:39:22 +02:00
Seth Forshee 5f65e5ca28 cred: Reject inodes with invalid ids in set_create_file_as()
Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-30 18:05:09 -05:00
Gregor Boirie eaaa7ec71b timekeeping: export get_monotonic_coarse64 symbol
EXPORT_SYMBOL() get_monotonic_coarse64 for new IIO timestamping clock
selection usage. This provides user apps the ability to request a
particular IIO device to timestamp samples using a monotonic coarse clock
granularity.

Signed-off-by: Gregor Boirie <gregor.boirie@parrot.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2016-06-30 19:41:23 +01:00
Daniel Borkmann 80b48c4457 bpf: don't use raw processor id in generic helper
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
instead of the raw variant. This allows for preemption checks when we
have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
need to keep the raw variant for socket filters, but we can reuse the
helper that is already there from cBPF side.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:54:40 -04:00
Daniel Borkmann 6816a7ffce bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_read
Follow-up commit to 1e33759c78 ("bpf, trace: add BPF_F_CURRENT_CPU
flag for bpf_perf_event_output") to add the same functionality into
bpf_perf_event_read() helper. The split of index into flags and index
component is also safe here, since such large maps are rejected during
map allocation time.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:54:40 -04:00