This reverts commit ea1e7ed337.
Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.
Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.
Revert the half-arsed attempt that didn't do anything. If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
so we loose 24 on each. An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.
Let's create a separate slab for page->ptl allocation to solve this.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) The addition of nftables. No longer will we need protocol aware
firewall filtering modules, it can all live in userspace.
At the core of nftables is a, for lack of a better term, virtual
machine that executes byte codes to inspect packet or metadata
(arriving interface index, etc.) and make verdict decisions.
Besides support for loading packet contents and comparing them, the
interpreter supports lookups in various datastructures as
fundamental operations. For example sets are supports, and
therefore one could create a set of whitelist IP address entries
which have ACCEPT verdicts attached to them, and use the appropriate
byte codes to do such lookups.
Since the interpreted code is composed in userspace, userspace can
do things like optimize things before giving it to the kernel.
Another major improvement is the capability of atomically updating
portions of the ruleset. In the existing netfilter implementation,
one has to update the entire rule set in order to make a change and
this is very expensive.
Userspace tools exist to create nftables rules using existing
netfilter rule sets, but both kernel implementations will need to
co-exist for quite some time as we transition from the old to the
new stuff.
Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
worked so hard on this.
2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
to our pseudo-random number generator, mostly used for things like
UDP port randomization and netfitler, amongst other things.
In particular the taus88 generater is updated to taus113, and test
cases are added.
3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
and Yang Yingliang.
4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
Sujir.
5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
Neal Cardwell, and Yuchung Cheng.
6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
control message data, much like other socket option attributes.
From Francesco Fusco.
7) Allow applications to specify a cap on the rate computed
automatically by the kernel for pacing flows, via a new
SO_MAX_PACING_RATE socket option. From Eric Dumazet.
8) Make the initial autotuned send buffer sizing in TCP more closely
reflect actual needs, from Eric Dumazet.
9) Currently early socket demux only happens for TCP sockets, but we
can do it for connected UDP sockets too. Implementation from Shawn
Bohrer.
10) Refactor inet socket demux with the goal of improving hash demux
performance for listening sockets. With the main goals being able
to use RCU lookups on even request sockets, and eliminating the
listening lock contention. From Eric Dumazet.
11) The bonding layer has many demuxes in it's fast path, and an RCU
conversion was started back in 3.11, several changes here extend the
RCU usage to even more locations. From Ding Tianhong and Wang
Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
Falico.
12) Allow stackability of segmentation offloads to, in particular, allow
segmentation offloading over tunnels. From Eric Dumazet.
13) Significantly improve the handling of secret keys we input into the
various hash functions in the inet hashtables, TCP fast open, as
well as syncookies. From Hannes Frederic Sowa. The key fundamental
operation is "net_get_random_once()" which uses static keys.
Hannes even extended this to ipv4/ipv6 fragmentation handling and
our generic flow dissector.
14) The generic driver layer takes care now to set the driver data to
NULL on device removal, so it's no longer necessary for drivers to
explicitly set it to NULL any more. Many drivers have been cleaned
up in this way, from Jingoo Han.
15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.
16) Improve CRC32 interfaces and generic SKB checksum iterators so that
SCTP's checksumming can more cleanly be handled. Also from Daniel
Borkmann.
17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
using the interface MTU value. This helps avoid PMTU attacks,
particularly on DNS servers. From Hannes Frederic Sowa.
18) Use generic XPS for transmit queue steering rather than internal
(re-)implementation in virtio-net. From Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
random32: add test cases for taus113 implementation
random32: upgrade taus88 generator to taus113 from errata paper
random32: move rnd_state to linux/random.h
random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
random32: add periodic reseeding
random32: fix off-by-one in seeding requirement
PHY: Add RTL8201CP phy_driver to realtek
xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
macmace: add missing platform_set_drvdata() in mace_probe()
ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
ixgbe: add warning when max_vfs is out of range.
igb: Update link modes display in ethtool
netfilter: push reasm skb through instead of original frag skbs
ip6_output: fragment outgoing reassembled skb properly
MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
net_sched: tbf: support of 64bit rates
ixgbe: deleting dfwd stations out of order can cause null ptr deref
ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
...
This patch proposes to make init failures more explicit.
Before this, the "No init found" message didn't help much. It could
sometimes be misleading and actually mean "No *working* init found".
This message could hide many different issues:
- no init program candidates found at all
- some init program candidates exist but can't be executed (missing
execute permissions, failed to load shared libraries, executable
compiled for an unknown architecture...)
This patch notifies the kernel user when a candidate init program is found
but can't be executed. In each failure situation, the error code is
displayed, to quickly find the root cause. "No init found" is also
replaced by "No working init found", which is more correct.
This will help embedded Linux developers (especially the newcomers),
regularly making and debugging new root filesystems.
Credits to Geert Uytterhoeven and Janne Karhunen for their improvement
suggestions.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Janne Karhunen <Janne.Karhunen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before commit 026cee0086
("params: <level>_initcall-like kernel parameters") the __setup
parameter parsing code could modify parameter in the
static_command_line buffer and such modifications were kept. After
that commit such modifications are destroyed during per-initcall level
parameter parsing because the same static_command_line buffer is used
and only parameters for appropriate initcall level are parsed.
That change broke at least parsing "ubd" parameter in the ubd driver
when the COW file is used.
Now the separate buffer is used for per-initcall parameter parsing.
Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Usage of the static key primitives to toggle a branch must not be used
before jump_label_init() is called from init/main.c. jump_label_init
reorganizes and wires up the jump_entries so usage before that could
have unforeseen consequences.
Following primitives are now checked for correct use:
* static_key_slow_inc
* static_key_slow_dec
* static_key_slow_dec_deferred
* jump_label_rate_limit
The x86 architecture already checks this by testing if the default_nop
was already replaced with an optimal nop or with a branch instruction. It
will panic then. Other architectures don't check for this.
Because we need to relax this check for the x86 arch to allow code to
transition from default_nop to the enabled state and other architectures
did not check for this at all this patch introduces checking on the
static_key primitives in a non-arch dependent manner.
All checked functions are considered slow-path so the additional check
does no harm to performance.
The warnings are best observed with earlyprintk.
Based on a patch from Andi Kleen.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
non-x86 platforms, in particular MIPS and ARM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSVvO/AAoJENNvdpvBGATwNZ4P+wadRWY/Gdz/p9332qdVrGYs
nP4DPWSg+n3RH/fOnacEwHF5vqapTe03G82NriCaVGFP8O9j7bo6ByMKKkIR7yvr
4sHUX4YMc/DwchaIHH2xp8fQoMc3Mv7mn8bodTtPXgveeldEvtuUQM0q+j4DXZUT
qSLMGElgJYrpIf2Cm8JAIBkt2QuzpZPPX7Z6glZunpvfLSMmgn3Vj2ilNEx1YCFH
v+Rk1ZYLjg2LzUYqaO7HOXlRJqmE10I7ZmNvPXJZ9fuPmGYD9FU6WeHhmIAFYdFw
V6bAzou+LbnuNVoW6yiDvrKcOXgh2Spbk6SaKVSrcjVPfc87ocNzGWI4OTfNy1xI
Kv9u4YfU3pIUWPDGx0mvT/KXAXl/PGVfu7bYXDcN2I2tqlrbBPdIWqpFB2eTn7/j
//XbatoT6gGZTuseCKhYXWpG8AE5pCfbjGnd9il21fvlUDdkIq42wAs96qjc6Ruj
tPCi5yYzLiHsn4eau+SJqI1KxPLf6YJw9Qo+f70FGl63wXJU9Vr07ID2rGTwXm1m
Qf1joTtx900PvfzUaD0ODbQZaTbX6ebSOkriKpKWYwg+26Gdc7JAxIVI3HDOlOR+
++r1M4ERwDic/xdVsB6Mngmop3d1BeNU2IAoiRDZwcJpS1+MLivlIbd1PjBAt0bU
+oOm+wseHEzSnlgucQ0g
=qnTe
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull /dev/random changes from Ted Ts'o:
"These patches are designed to enable improvements to /dev/random for
non-x86 platforms, in particular MIPS and ARM"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: allow architectures to optionally define random_get_entropy()
random: run random_int_secret_init() run after all late_initcalls
Replace the single preempt_count() 'function' that's an lvalue with
two proper functions:
preempt_count() - returns the preempt_count value as rvalue
preempt_count_set() - Allows setting the preempt-count value
Also provide preempt_count_ptr() as a convenience wrapper to implement
all modifying operations.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
[ Fixed build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The some platforms (e.g., ARM) initializes their clocks as
late_initcalls for some unknown reason. So make sure
random_int_secret_init() is run after all of the late_initcalls are
run.
Cc: stable@vger.kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Prepare for using a static key in the context tracking subsystem.
This will help optimizing the off case on its many users:
* user_enter, user_exit, exception_enter, exception_exit, guest_enter,
guest_exit, vtime_*()
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kevin Hilman <khilman@linaro.org>
Pull timer core updates from Thomas Gleixner:
"The timer changes contain:
- posix timer code consolidation and fixes for odd corner cases
- sched_clock implementation moved from ARM to core code to avoid
duplication by other architectures
- alarm timer updates
- clocksource and clockevents unregistration facilities
- clocksource/events support for new hardware
- precise nanoseconds RTC readout (Xen feature)
- generic support for Xen suspend/resume oddities
- the usual lot of fixes and cleanups all over the place
The parts which touch other areas (ARM/XEN) have been coordinated with
the relevant maintainers. Though this results in an handful of
trivial to solve merge conflicts, which we preferred over nasty cross
tree merge dependencies.
The patches which have been committed in the last few days are bug
fixes plus the posix timer lot. The latter was in akpms queue and
next for quite some time; they just got forgotten and Frederic
collected them last minute."
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
hrtimer: Remove unused variable
hrtimers: Move SMP function call to thread context
clocksource: Reselect clocksource when watchdog validated high-res capability
posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
posix_timers: fix racy timer delta caching on task exit
posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
selftests: add basic posix timers selftests
posix_cpu_timers: consolidate expired timers check
posix_cpu_timers: consolidate timer list cleanups
posix_cpu_timer: consolidate expiry time type
tick: Sanitize broadcast control logic
tick: Prevent uncontrolled switch to oneshot mode
tick: Make oneshot broadcast robust vs. CPU offlining
x86: xen: Sync the CMOS RTC as well as the Xen wallclock
x86: xen: Sync the wallclock when the system time is set
timekeeping: Indicate that clock was set in the pvclock gtod notifier
timekeeping: Pass flags instead of multiple bools to timekeeping_update()
xen: Remove clock_was_set() call in the resume path
hrtimers: Support resuming with two or more CPUs online (but stopped)
timer: Fix jiffies wrap behavior of round_jiffies_common()
...
do_one_initcall() uses a 64 byte string buffer to save a message. This
buffer is declared static and is only used at boot up and when a module
is loaded. As 64 bytes is very small, and this function has very limited
scope, there's no reason to waste permanent memory with this string and
not just simply put it on the stack.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nothing about the sched_clock implementation in the ARM port is
specific to the architecture. Generalize the code so that other
architectures can use it by selecting GENERIC_SCHED_CLOCK.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
[jstultz: Merge minor collisions with other patches in my tree]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.
This patch changes the timer source for multiplexing. It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.
The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.
In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated. It is stopped when the PMU is not overcommitted.
In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().
The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull 'full dynticks' support from Ingo Molnar:
"This tree from Frederic Weisbecker adds a new, (exciting! :-) core
kernel feature to the timer and scheduler subsystems: 'full dynticks',
or CONFIG_NO_HZ_FULL=y.
This feature extends the nohz variable-size timer tick feature from
idle to busy CPUs (running at most one task) as well, potentially
reducing the number of timer interrupts significantly.
This feature got motivated by real-time folks and the -rt tree, but
the general utility and motivation of full-dynticks runs wider than
that:
- HPC workloads get faster: CPUs running a single task should be able
to utilize a maximum amount of CPU power. A periodic timer tick at
HZ=1000 can cause a constant overhead of up to 1.0%. This feature
removes that overhead - and speeds up the system by 0.5%-1.0% on
typical distro configs even on modern systems.
- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.
- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.
The cost of the feature is mainly related to increased timer
reprogramming overhead when a CPU switches its tick period, and thus
slightly longer to-idle and from-idle latency.
Configuration-wise a third mode of operation is added to the existing
two NOHZ kconfig modes:
- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
as a config option. This is the traditional Linux periodic tick
design: there's a HZ tick going on all the time, regardless of
whether a CPU is idle or not.
- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.
- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
tick when a CPU is idle, also slows the tick down to 1 Hz (one
timer interrupt per second) when only a single task is running on a
CPU.
The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the
user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
default.
This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support
and non-periodic cputime support in particular is upstream already.
This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:
- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.
- it has a number of fresh commits that came in after the merge
window. The overwhelming majority of commits are from before the
merge window, but still some aspects of the tree are fresh and so I
marked it RFC.
- it's a pretty wide-reaching feature with lots of effects - and
while the components have been in testing for some time, the full
combination is still not very widely used. That it's default-off
should reduce its regression abilities and obviously there are no
known regressions with CONFIG_NO_HZ_FULL=y enabled either.
- the feature is not completely idempotent: there is no 100%
equivalent replacement for a periodic scheduler/timer tick. In
particular there's ongoing work to map out and reduce its effects
on scheduler load-balancing and statistics. This should not impact
correctness though, there are no known regressions related to this
feature at this point.
- it's a pretty ambitious feature that with time will likely be
enabled by most Linux distros, and we'd like you to make input on
its design/implementation, if you dislike some aspect we missed.
Without flaming us to crisp! :-)
Future plans:
- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go
for the 0 Hz target though.
- once we reach 0 Hz we can remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only
as frequently as the sched_latency constraints require us to do -
once every 4-40 msecs, depending on nr_running.
I am personally leaning towards biting the bullet and doing this in
v3.10, like the -rt tree this effort has been going on for too long -
but the final word is up to you as usual.
More technical details can be found in Documentation/timers/NO_HZ.txt"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
sched: Keep at least 1 tick per second for active dynticks tasks
rcu: Fix full dynticks' dependency on wide RCU nocb mode
nohz: Protect smp_processor_id() in tick_nohz_task_switch()
nohz_full: Add documentation.
cputime_nsecs: use math64.h for nsec resolution conversion helpers
nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
nohz: Reduce overhead under high-freq idling patterns
nohz: Remove full dynticks' superfluous dependency on RCU tree
nohz: Fix unavailable tick_stop tracepoint in dynticks idle
nohz: Add basic tracing
nohz: Select wide RCU nocb for full dynticks
nohz: Disable the tick when irq resume in full dynticks CPU
nohz: Re-evaluate the tick for the new task after a context switch
nohz: Prepare to stop the tick on irq exit
nohz: Implement full dynticks kick
nohz: Re-evaluate the tick from the scheduler IPI
sched: New helper to prevent from stopping the tick in full dynticks
sched: Kick full dynticks CPU that have more than one task enqueued.
perf: New helper to prevent full dynticks CPUs from stopping tick
perf: Kick full dynticks CPU if events rotation is needed
...
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.
Merge a common upstream merge point that has these
updates.
Conflicts:
include/linux/perf_event.h
kernel/rcutree.h
kernel/rcutree_plugin.h
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Commit f91eb62f71 ("init: scream bloody murder if interrupts are
enabled too early") added three new warnings. The first two seemed
reasonable, but the third included a warning when an initcall returned
non-zero. Although, the third WARN() does include an imbalanced preempt
disabled, or irqs disable, it shouldn't warn if it only had an initcall
that just returns non-zero.
In fact, according to Linus, it shouldn't print at all. As it only
prints with initcall_debug set, and that already shows enough
information to fix things.
Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core timer updates from Ingo Molnar:
"The main changes in this cycle's merge are:
- Implement shadow timekeeper to shorten in kernel reader side
blocking, by Thomas Gleixner.
- Posix timers enhancements by Pavel Emelyanov:
- allocate timer ID per process, so that exact timer ID allocations
can be re-created be checkpoint/restore code.
- debuggability and tooling (/proc/PID/timers, etc.) improvements.
- suspend/resume enhancements by Feng Tang: on certain new Intel Atom
processors (Penwell and Cloverview), there is a feature that the
TSC won't stop in S3 state, so the TSC value won't be reset to 0
after resume. This can be taken advantage of by the generic via
the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
recover/approximate sleep time, the main (and precise) clocksource
can be used.
- Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
CPUs the file goes beyond 4MB of size and thus the current
simplistic seqfile approach fails. Convert /proc/timer_list to a
proper seq_file with its own iterator.
- Cleanups and refactorings of the core timekeeping code by John
Stultz.
- International Atomic Clock time is managed by the NTP code
internally currently but not exposed externally. Separate the TAI
code out and add CLOCK_TAI support and TAI support to the hrtimer
and posix-timer code, by John Stultz.
- Add deep idle support enhacement to the broadcast clockevents core
timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
clockevents feature (which will be utilized by future clockevents
driver updates), which allows the use of IRQ affinities to avoid
spurious wakeups of idle CPUs - the right CPU with an expiring
timer will be woken.
- Add new ARM bcm281xx clocksource driver, by Christian Daudt
- ... various other fixes and cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
clockevents: Set dummy handler on CPU_DEAD shutdown
timekeeping: Update tk->cycle_last in resume
posix-timers: Remove unused variable
clockevents: Switch into oneshot mode even if broadcast registered late
timer_list: Convert timer list to be a proper seq_file
timer_list: Split timer_list_show_tickdevices
posix-timers: Show sigevent info in proc file
posix-timers: Introduce /proc/PID/timers file
posix timers: Allocate timer id per process (v2)
timekeeping: Make sure to notify hrtimers when TAI offset changes
hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
hrtimer: Add expiry time overflow check in hrtimer_interrupt
timekeeping: Shorten seq_count region
timekeeping: Implement a shadow timekeeper
timekeeping: Delay update of clock->cycle_last
timekeeping: Store cycle_last value in timekeeper struct as well
ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
timekeeping: Simplify tai updating from do_adjtimex
timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
...
Pull SMP/hotplug changes from Ingo Molnar:
"This is a pretty large, multi-arch series unifying and generalizing
the various disjunct pieces of idle routines that architectures have
historically copied from each other and have grown in random, wildly
inconsistent and sometimes buggy directions:
101 files changed, 455 insertions(+), 1328 deletions(-)
this went through a number of review and test iterations before it was
committed, it was tested on various architectures, was exposed to
linux-next for quite some time - nevertheless it might cause problems
on architectures that don't read the mailing lists and don't regularly
test linux-next.
This cat herding excercise was motivated by the -rt kernel, and was
brought to you by Thomas "the Whip" Gleixner."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
idle: Remove GENERIC_IDLE_LOOP config switch
um: Use generic idle loop
ia64: Make sure interrupts enabled when we "safe_halt()"
sparc: Use generic idle loop
idle: Remove unused ARCH_HAS_DEFAULT_IDLE
bfin: Fix typo in arch_cpu_idle()
xtensa: Use generic idle loop
x86: Use generic idle loop
unicore: Use generic idle loop
tile: Use generic idle loop
tile: Enter idle with preemption disabled
sh: Use generic idle loop
score: Use generic idle loop
s390: Use generic idle loop
powerpc: Use generic idle loop
parisc: Use generic idle loop
openrisc: Use generic idle loop
mn10300: Use generic idle loop
mips: Use generic idle loop
microblaze: Use generic idle loop
...
Also enables cleanup of some 80-col trickery.
Cc: Richard Weinberger <richard@nod.at>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the kernel was booted with the "quiet" boot option we have currently no
chance to see why an initrd fails. Change KERN_WARNING to KERN_ERR to see
what is going on.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As I was testing a lot of my code recently, and having several
"successes", I accidentally noticed in the dmesg this little line:
start_kernel(): bug: interrupts were enabled *very* early, fixing it
Sure enough, one of my patches two commits ago enabled interrupts early.
The sad part here is that I never noticed it, and I ran several tests with
ktest too, and ktest did not notice this line.
What ktest looks for (and so does many other automated testing scripts) is
a back trace produced by a WARN_ON() or BUG(). As a back trace was never
produced, my buggy patch could have slipped into linux-next, or even
worse, mainline.
Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:
PID hash table entries: 4096 (order: 3, 32768 bytes)
__ex_table already sorted, skipping sort
Checking aperture...
No AGP bridge found
Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
Hardware name: To Be Filled By O.E.M.
Interrupts were enabled *very* early, fixing it
Modules linked in:
Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
Call Trace:
warn_slowpath_common+0x83/0x9b
warn_slowpath_fmt+0x46/0x48
start_kernel+0x21e/0x415
x86_64_start_reservations+0x10e/0x112
x86_64_start_kernel+0x102/0x111
---[ end trace 007d8b0491b4f5d8 ]---
Preemptible hierarchical RCU implementation.
RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
NR_IRQS:4352 nr_irqs:712 16
Console: colour VGA+ 80x25
console [ttyS0] enabled, bootconsole disabled
Do you see it?
The original version of this patch just slapped a WARN_ON() in there and
kept the printk(). Ard van Breemen suggested using the WARN() interface,
which makes the code a bit cleaner.
Also, while examining other warnings in init/main.c, I found two other
locations that deserve a bloody murder scream if their conditions are hit,
and updated them accordingly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need full dynticks CPU to also be RCU nocb so
that we don't have to keep the tick to handle RCU
callbacks.
Make sure the range passed to nohz_full= boot
parameter is a subset of rcu_nocbs=
The CPUs that fail to meet this requirement will be
excluded from the nohz_full range. This is checked
early in boot time, before any CPU has the opportunity
to stop its tick.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
For now this calls cpu_idle(), but in the long run we want to move the
cpu bringup code to the core and therefor we add a state argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To convert the clockevents code to cpumask_var_t we need to move the
init call after the allocator setup.
Clockevents are earliest registered from time_init() as they need
interrupts being set up, so this is safe.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130306111537.304379448@linutronix.de
Cc: Rusty Russell <rusty@rustcorp.com.au>
Pull async changes from Tejun Heo:
"These are followups for the earlier deadlock issue involving async
ending up waiting for itself through block requesting module[1]. The
following changes are made by these commits.
- Instead of requesting default elevator on each request_queue init,
block now requests it once early during boot.
- Kmod triggers warning if invoked from an async worker.
- Async synchronization implementation has been reimplemented. It's
a lot simpler now."
* 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
async: initialise list heads to fix crash
async: replace list of active domains with global list of pending items
async: keep pending tasks on async_domain and remove async_pending
async: use ULLONG_MAX for infinity cookie value
async: bring sanity to the use of words domain and running
async, kmod: warn on synchronous request_module() from async workers
block: don't request module during elevator init
init, block: try to load default elevator module early during boot
Pull x86 fixes from Peter Anvin:
"This is a collection of miscellaneous fixes, the most important one is
the fix for the Samsung laptop bricking issue (auto-blacklisting the
samsung-laptop driver); the efi_enabled() changes you see below are
prerequisites for that fix.
The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
with I/O port references."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
samsung-laptop: Disable on EFI hardware
efi: Make 'efi_enabled' a function to query EFI facilities
smp: Fix SMP function call empty cpu mask race
x86/msr: Add capabilities check
x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
x86/olpc: Fix olpc-xo1-sci.c build errors
arch/x86/platform/uv: Fix incorrect tlb flush all issue
x86-64: Fix unwind annotations in recent NMI changes
x86-32: Start out cr0 clean, disable paging before modifying cr3/4
Originally 'efi_enabled' indicated whether a kernel was booted from
EFI firmware. Over time its semantics have changed, and it now
indicates whether or not we are booted on an EFI machine with
bit-native firmware, e.g. 64-bit kernel with 64-bit firmware.
The immediate motivation for this patch is the bug report at,
https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557
which details how running a platform driver on an EFI machine that is
designed to run under BIOS can cause the machine to become
bricked. Also, the following report,
https://bugzilla.kernel.org/show_bug.cgi?id=47121
details how running said driver can also cause Machine Check
Exceptions. Drivers need a new means of detecting whether they're
running on an EFI machine, as sadly the expression,
if (!efi_enabled)
hasn't been a sufficient condition for quite some time.
Users actually want to query 'efi_enabled' for different reasons -
what they really want access to is the list of available EFI
facilities.
For instance, the x86 reboot code needs to know whether it can invoke
the ResetSystem() function provided by the EFI runtime services, while
the ACPI OSL code wants to know whether the EFI config tables were
mapped successfully. There are also checks in some of the platform
driver code to simply see if they're running on an EFI machine (which
would make it a bad idea to do BIOS-y things).
This patch is a prereq for the samsung-laptop fix patch.
Cc: David Airlie <airlied@linux.ie>
Cc: Corentin Chary <corentincj@iksaif.net>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Peter Jones <pjones@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Steve Langasek <steve.langasek@canonical.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds default module loading and uses it to load the default
block elevator. During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland. This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.
This will replace the on-demand elevator module loading from elevator
init path.
v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang We <fengguang.wu@intel.com>
Commit d6b2123802 "make sure that we always have a return path from
kernel_execve()" reshuffled kernel_init()/init_post() to ensure that
kernel_execve() has a caller to return to.
It removed __init annotation for kernel_init() and introduced/calls a
new routine kernel_init_freeable(). Latter however is inlined by any
reasonable compiler (ARC gcc 4.4 in this case), causing slight code
bloat.
This patch forces kernel_init_freeable() as noinline reducing the .text
bloat-o-meter vmlinux vmlinux_new
add/remove: 1/0 grow/shrink: 0/1 up/down: 374/-334 (40)
function old new delta
kernel_init_freeable - 374 +374 (.init.text)
kernel_init 628 294 -334 (.text)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.
Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."
Fixed up conflicts as per Al.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure
All architectures have
CONFIG_GENERIC_KERNEL_THREAD
CONFIG_GENERIC_KERNEL_EXECVE
__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.
This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.
This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.
This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.
The files under /proc/<pid>/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.
The files under /proc/<pid>/ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.
Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.
Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc/<pid>/ns/ files in this tree.
Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.
Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...
This reverts commit bd52276fa1 ("x86-64/efi: Use EFI to deal with
platform wall clock (again)"), and the two supporting commits:
da5a108d05b4: "x86/kernel: remove tboot 1:1 page table creation code"
185034e72d59: "x86, efi: 1:1 pagetable mapping for virtual EFI calls")
as they all depend semantically on commit 53b87cf088 ("x86, mm:
Include the entire kernel memory map in trampoline_pgd") that got
reverted earlier due to the problems it caused.
This was pointed out by Yinghai Lu, and verified by me on my Macbook Air
that uses EFI.
Pointed-out-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 EFI update from Peter Anvin:
"EFI tree, from Matt Fleming. Most of the patches are the new efivarfs
filesystem by Matt Garrett & co. The balance are support for EFI
wallclock in the absence of a hardware-specific driver, and various
fixes and cleanups."
* 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
efivarfs: Make efivarfs_fill_super() static
x86, efi: Check table header length in efi_bgrt_init()
efivarfs: Use query_variable_info() to limit kmalloc()
efivarfs: Fix return value of efivarfs_file_write()
efivarfs: Return a consistent error when efivarfs_get_inode() fails
efivarfs: Make 'datasize' unsigned long
efivarfs: Add unique magic number
efivarfs: Replace magic number with sizeof(attributes)
efivarfs: Return an error if we fail to read a variable
efi: Clarify GUID length calculations
efivarfs: Implement exclusive access for {get,set}_variable
efivarfs: efivarfs_fill_super() ensure we clean up correctly on error
efivarfs: efivarfs_fill_super() ensure we free our temporary name
efivarfs: efivarfs_fill_super() fix inode reference counts
efivarfs: efivarfs_create() ensure we drop our reference on inode on error
efivarfs: efivarfs_file_read ensure we free data in error paths
x86-64/efi: Use EFI to deal with platform wall clock (again)
x86/kernel: remove tboot 1:1 page table creation code
x86, efi: 1:1 pagetable mapping for virtual EFI calls
x86, mm: Include the entire kernel memory map in trampoline_pgd
...
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
for the system init process, and another way for pid namespace
init processes test pid->nr == 1 and use the same code for both.
For the global init this results in SIGNAL_UNKILLABLE being set
much earlier in the initialization process.
This is a small cleanup and it paves the way for allowing unshare and
enter of the pid namespace as that path like our global init also will
not set CLONE_NEWPID.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
gcc-4.1.2 inlines weak functions, which causes FRV to fail when the dummy
thread_info_cache_init() gets inlined into start_kernel().
Signed-off-by: David Howells <dhowells@redhat.com>
Other than ix86, x86-64 on EFI so far didn't set the
{g,s}et_wallclock accessors to the EFI routines, thus
incorrectly using raw RTC accesses instead.
Simply removing the #ifdef around the respective code isn't
enough, however: While so far early get-time calls were done in
physical mode, this doesn't work properly for x86-64, as virtual
addresses would still need to be set up for all runtime regions
(which wasn't the case on the system I have access to), so
instead the patch moves the call to efi_enter_virtual_mode()
ahead (which in turn allows to drop all code related to calling
efi-get-time in physical mode).
Additionally the earlier calling of efi_set_executable()
requires the CPA code to cope, i.e. during early boot it must be
avoided to call cpa_flush_array(), as the first thing this
function does is a BUG_ON(irqs_disabled()).
Also make the two EFI functions in question here static -
they're not being referenced elsewhere.
History:
This commit was originally merged as bacef661ac ("x86-64/efi:
Use EFI to deal with platform wall clock") but it resulted in some
ASUS machines no longer booting due to a firmware bug, and so was
reverted in f026cfa82f. A pre-emptive fix for the buggy ASUS
firmware was merged in 03a1c254975e ("x86, efi: 1:1 pagetable
mapping for virtual EFI calls") so now this patch can be
reapplied.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Matt Fleming <matt.fleming@intel.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com> [added commit history]
Pull third pile of kernel_execve() patches from Al Viro:
"The last bits of infrastructure for kernel_thread() et.al., with
alpha/arm/x86 use of those. Plus sanitizing the asm glue and
do_notify_resume() on alpha, fixing the "disabled irq while running
task_work stuff" breakage there.
At that point the rest of kernel_thread/kernel_execve/sys_execve work
can be done independently for different architectures. The only
pending bits that do depend on having all architectures converted are
restrictred to fs/* and kernel/* - that'll obviously have to wait for
the next cycle.
I thought we'd have to wait for all of them done before we start
eliminating the longjump-style insanity in kernel_execve(), but it
turned out there's a very simple way to do that without flagday-style
changes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to saner kernel_execve() semantics
arm: switch to saner kernel_execve() semantics
x86, um: convert to saner kernel_execve() semantics
infrastructure for saner ret_from_kernel_thread semantics
make sure that kernel_thread() callbacks call do_exit() themselves
make sure that we always have a return path from kernel_execve()
ppc: eeh_event should just use kthread_run()
don't bother with kernel_thread/kernel_execve for launching linuxrc
alpha: get rid of switch_stack argument of do_work_pending()
alpha: don't bother passing switch_stack separately from regs
alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
alpha: simplify TIF_NEED_RESCHED handling
* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
call schedule_tail
call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE
This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only place where kernel_execve() is called without a way to
return to the caller of kernel_thread() callback is kernel_post().
Reorganize kernel_init()/kernel_post() - instead of the former
calling the latter in the end (and getting freed by it), have the
latter *begin* with calling the former (and turn the latter into
kernel_thread() callback, of course).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After both prio_tree users have been converted to use red-black trees,
there is no need to keep around the prio tree library anymore.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ACPI BGRT driver accesses the BIOS logo image when it initializes.
However, ACPI 5.0 (which introduces the BGRT) recommends putting the
logo image in EFI boot services memory, so that the OS can reclaim that
memory. Production systems follow this recommendation, breaking the
ACPI BGRT driver.
Move the bulk of the BGRT code to run during a new EFI late
initialization phase, which occurs after switching EFI to virtual mode,
and after initializing ACPI, but before freeing boot services memory.
Copy the BIOS logo image to kernel memory at that point, and make it
accessible to the BGRT driver. Rework the existing ACPI BGRT driver to
act as a simple wrapper exposing that image (and the properties from the
BGRT) via sysfs.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: http://lkml.kernel.org/r/93ce9f823f1c1f3bb88bdd662cce08eee7a17f5d.1348876882.git.josh@joshtriplett.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>