Impact: clean up
Now that a generic in_nmi is available, this patch removes the
special code in the ring_buffer and implements the in_nmi generic
version instead.
With this change, I was also able to rename the "arch_ftrace_nmi_enter"
back to "ftrace_nmi_enter" and remove the code from the ring buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent deadlock in NMI
The ring buffers are not yet totally lockless with writing to
the buffer. When a writer crosses a page, it grabs a per cpu spinlock
to protect against a reader. The spinlocks taken by a writer are not
to protect against other writers, since a writer can only write to
its own per cpu buffer. The spinlocks protect against readers that
can touch any cpu buffer. The writers are made to be reentrant
with the spinlocks disabling interrupts.
The problem arises when an NMI writes to the buffer, and that write
crosses a page boundary. If it grabs a spinlock, it can be racing
with another writer (since disabling interrupts does not protect
against NMIs) or with a reader on the same CPU. Luckily, most of the
users are not reentrant and protects against this issue. But if a
user of the ring buffer becomes reentrant (which is what the ring
buffers do allow), if the NMI also writes to the ring buffer then
we risk the chance of a deadlock.
This patch moves the ftrace_nmi_enter called by nmi_enter() to the
ring buffer code. It replaces the current ftrace_nmi_enter that is
used by arch specific code to arch_ftrace_nmi_enter and updates
the Kconfig to handle it.
When an NMI is called, it will set a per cpu variable in the ring buffer
code and will clear it when the NMI exits. If a write to the ring buffer
crosses page boundaries inside an NMI, a trylock is used on the spin
lock instead. If the spinlock fails to be acquired, then the entry
is discarded.
This bug appeared in the ftrace work in the RT tree, where event tracing
is reentrant. This workaround solved the deadlocks that appeared there.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix bad times of recent resets
The ring buffer needs to reset its timestamps when reseting of the
buffer, otherwise the timestamps are stale and might be used to
calculate times in the buffer causing funny timestamps to appear.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bad times of recent resets
The ring buffer needs to reset its timestamps when reseting of the
buffer, otherwise the timestamps are stale and might be used to
calculate times in the buffer causing funny timestamps to appear.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the ring buffer recording has been disabled. Do not let
swapping of ring buffers occur. Simply return -EAGAIN.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reset struct buffer_page.write when interrupt storm
if struct buffer_page.write is not reset, any succedent committing
will corrupted ring_buffer:
static inline void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
......
cpu_buffer->commit_page->commit =
cpu_buffer->commit_page->write;
......
}
when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
is disabled, but some reserved buffers may haven't been committed.
we need reset struct buffer_page.write.
when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
is still available, we should not corrupt it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to allow some archs to use the ring buffer
Commits in the ring buffer are checked by pointer arithmetic.
If the calculation is incorrect, then the commits will never take
place and the buffer will simply fill up and report an error.
Each page in the ring buffer has a small header:
struct buffer_data_page {
u64 time_stamp;
local_t commit;
unsigned char data[];
};
Unfortuntely, some of the calculations used sizeof(struct buffer_data_page)
to know the size of the header. But this is incorrect on some archs,
where sizeof(struct buffer_data_page) does not equal
offsetof(struct buffer_data_page, data), and on those archs, the commits
are never processed.
This patch replaces the sizeof with offsetof.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reset struct buffer_page.write when interrupt storm
if struct buffer_page.write is not reset, any succedent committing
will corrupted ring_buffer:
static inline void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
......
cpu_buffer->commit_page->commit =
cpu_buffer->commit_page->write;
......
}
when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
is disabled, but some reserved buffers may haven't been committed.
we need reset struct buffer_page.write.
when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
is still available, we should not corrupt it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile: (31 commits)
powerpc/oprofile: fix whitespaces in op_model_cell.c
powerpc/oprofile: IBM CELL: add SPU event profiling support
powerpc/oprofile: fix cell/pr_util.h
powerpc/oprofile: IBM CELL: cleanup and restructuring
oprofile: make new cpu buffer functions part of the api
oprofile: remove #ifdef CONFIG_OPROFILE_IBS in non-ibs code
ring_buffer: fix ring_buffer_event_length()
oprofile: use new data sample format for ibs
oprofile: add op_cpu_buffer_get_data()
oprofile: add op_cpu_buffer_add_data()
oprofile: rework implementation of cpu buffer events
oprofile: modify op_cpu_buffer_read_entry()
oprofile: add op_cpu_buffer_write_reserve()
oprofile: rename variables in add_ibs_begin()
oprofile: rename add_sample() in cpu_buffer.c
oprofile: rename variable ibs_allowed to has_ibs in op_model_amd.c
oprofile: making add_sample_entry() inline
oprofile: remove backtrace code for ibs
oprofile: remove unused ibs macro
oprofile: remove unused components in struct oprofile_cpu_buffer
...
Function ring_buffer_event_length() provides an interface to detect
the length of data stored in an entry. However, the length contains
offsets depending on the internal usage. This makes it unusable. This
patch fixes this and now ring_buffer_event_length() returns the
alligned length that has been used in ring_buffer_lock_reserve().
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Impact: Reduce future memory usage, use new cpumask API.
(Eventually, cpumask_var_t will be allocated based on nr_cpu_ids, not NR_CPUS).
Convert kernel trace functions to use struct cpumask API:
1) Use cpumask_copy/cpumask_test_cpu/for_each_cpu.
2) Use cpumask_var_t and alloc_cpumask_var/free_cpumask_var everywhere.
3) Use on_each_cpu instead of playing with current->cpus_allowed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
* 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
oprofile: select RING_BUFFER
ring_buffer: adding EXPORT_SYMBOLs
oprofile: fix lost sample counter
oprofile: remove nr_available_slots()
oprofile: port to the new ring_buffer
ring_buffer: add remaining cpu functions to ring_buffer.h
oprofile: moving cpu_buffer_reset() to cpu_buffer.h
oprofile: adding cpu_buffer_entries()
oprofile: adding cpu_buffer_write_commit()
oprofile: adding cpu buffer r/w access functions
ftrace: remove unused function arg in trace_iterator_increment()
ring_buffer: update description for ring_buffer_alloc()
oprofile: set values to default when creating oprofilefs
oprofile: implement switch/case in buffer_sync.c
x86/oprofile: cleanup IBS init/exit functions in op_model_amd.c
x86/oprofile: reordering IBS code in op_model_amd.c
oprofile: fix typo
oprofile: whitspace changes only
oprofile: update comment for oprofile_add_sample()
oprofile: comment cleanup
Impact: eliminate false WARN_ON message
If an interrupt goes off after the setting of the local variable
tail_page and before incrementing the write index of that page,
the interrupt could push the commit forward to the next page.
Later a check is made to see if interrupts pushed the buffer around
the entire ring buffer by comparing the next page to the last commited
page. This can produce a false positive if the interrupt had pushed
the commit page forward as stated above.
Thanks to Jiaying Zhang for finding this race.
Reported-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix stuck trace-buffers
If an interrupt comes in during the rb_set_commit_to_write and
pushes the tail page forward just at the right time, the commit
updates will miss the adding of the interrupt data. This will
cause the commit pointer to cease from moving forward.
Thanks to Jiaying Zhang for finding this race.
Reported-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
struct ring_buffer.size is not set after ring_buffer is initialized
or resized. it is always 0.
we can use "buffer->pages * PAGE_SIZE" to get ring_buffer's size
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent a trace recursion
After some tests with function graph tracer under x86-32, I saw some recursions
caused by ring_buffer_time_stamp() that calls preempt_enable_no_notrace() which
calls preempt_schedule() which is traced itself.
This patch re-enables preemption without rescheduling.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I added EXPORT_SYMBOL_GPLs for all functions part of the API
(ring_buffer.h). This is required since oprofile is using the ring
buffer and the compilation as modules would fail otherwise.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Andrew Morton pointed out that the kernel convention of a variable
named page should be of type page struct. The ring buffer uses
a variable named "page" for a pointer to something else.
This patch converts those to be called "bpage" (as in "buffer page").
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API to ring buffer
This patch adds a new interface into the ring buffer that allows a
page to be read from the ring buffer on a given CPU. For every page
read, one must also be given to allow for a "swap" of the pages.
rpage = ring_buffer_alloc_read_page(buffer);
if (!rpage)
goto err;
ret = ring_buffer_read_page(buffer, &rpage, cpu, full);
if (!ret)
goto empty;
process_page(rpage);
ring_buffer_free_read_page(rpage);
The caller of these functions must handle any waits that are
needed to wait for new data. The ring_buffer_read_page will simply
return 0 if there is no data, or if "full" is set and the writer
is still on the current page.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: get ready for splice changes
This patch moves the commit and timestamp into the beginning of each
data page of the buffer. This change will allow the page to be moved
to another location (disk, network, etc) and still have information
in the page to be able to read it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent unnecessary stack recursion
if the resched flag was set before we entered, then don't reschedule.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature to permanently disable ring buffer
This patch adds a API to the ring buffer code that will permanently
disable the ring buffer from ever recording. This should only be
called when some serious anomaly is detected, and the system
may be in an unstable state. When that happens, shutting down the
recording to the ring buffers may be appropriate.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix tracing buffer mutex leak in case of allocation failure
This error was spotted by this semantic patch:
http://www.emn.fr/x-info/coccinelle/mut.html
It looks correct as far as I can tell. Please review.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pekka reported a crash when resizing the mmiotrace tracer (if only
mmiotrace is enabled).
This happens because in that case we do not allocate the max buffer,
but we try to use it.
Make ring_buffer_resize() idempotent against NULL buffers.
Reported-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: deadlock fix in ring_buffer_read_start
The ring_buffer_iter_reset was called from ring_buffer_read_start
where both grabbed the reader_lock.
This patch separates out the internals of ring_buffer_iter_reset
to its own function so that both APIs may grab the reader_lock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: disable preemption when calling sched_clock()
The ring_buffer_time_stamp still uses sched_clock as its counter.
But it is a bug to call it with preemption enabled. This requirement
should not be pushed to the ring_buffer_time_stamp callers, so
the ring_buffer_time_stamp needs to disable preemption when calling
sched_clock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Restructure WARN_ONs in ring_buffer.c
The current WARN_ON macros in ring_buffer.c are quite ugly.
This patch cleans them up and uses a single RB_WARN_ON that returns
the value of the condition. This allows the caller to abort the
function if the condition is true.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enable/disable ring buffer recording API added
Several kernel developers have requested that there be a way to stop
recording into the ring buffers with a simple switch that can also
be enabled from userspace. This patch addes a new kernel API to the
ring buffers called:
tracing_on()
tracing_off()
When tracing_off() is called, all ring buffers will not be able to record
into their buffers.
tracing_on() will enable the ring buffers again.
These two act like an on/off switch. That is, there is no counting of the
number of times tracing_off or tracing_on has been called.
A new file is added to the debugfs/tracing directory called
tracing_on
This allows for userspace applications to also flip the switch.
echo 0 > debugfs/tracing/tracing_on
disables the tracing.
echo 1 > /debugfs/tracing/tracing_on
enables it.
Note, this does not disable or enable any tracers. It only sets or clears
a flag that needs to be set in order for the ring buffers to write to
their buffers. It is a global flag, and affects all ring buffers.
The buffers start out with tracing_on enabled.
There are now three flags that control recording into the buffers:
tracing_on: which affects all ring buffer tracers.
buffer->record_disabled: which affects an allocated buffer, which may be set
if an anomaly is detected, and tracing is disabled.
cpu_buffer->record_disabled: which is set by tracing_stop() or if an
anomaly is detected. tracing_start can not reenable this if
an anomaly occurred.
The userspace debugfs/tracing/tracing_enabled is implemented with
tracing_stop() but the user space code can not enable it if the kernel
called tracing_stop().
Userspace can enable the tracing_on even if the kernel disabled it.
It is just a switch used to stop tracing if a condition was hit.
tracing_on is not for protecting critical areas in the kernel nor is
it for stopping tracing if an anomaly occurred. This is because userspace
can reenable it at any time.
Side effect: With this patch, I discovered a dead variable in ftrace.c
called tracing_on. This patch removes it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: serialize reader accesses to individual CPU ring buffers
The code in the ring buffer expects only one reader at a time, but currently
it puts that requirement on the caller. This is not strong enough, and this
patch adds a "reader_lock" that serializes the access to the reader API
of the ring buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces most of the BUG_ONs in the ring_buffer code with
RB_WARN_ON variants. It adds some more variants as needed for the
replacement. This lets the buffer die nicely and still warn the user.
One BUG_ON remains in the code, and that is because it detects a
bad pointer passed in by the calling function, and not a bug by
the ring buffer code itself.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: removal of unnecessary looping
The lockless part of the ring buffer allows for reentry into the code
from interrupts. A timestamp is taken, a test is preformed and if it
detects that an interrupt occurred that did tracing, it tries again.
The problem arises if the timestamp code itself causes a trace.
The detection will detect this and loop again. The difference between
this and an interrupt doing tracing, is that this will fail every time,
and cause an infinite loop.
Currently, we test if the loop happens 1000 times, and if so, it will
produce a warning and disable the ring buffer.
The problem with this approach is that it makes it difficult to perform
some types of tracing (tracing the timestamp code itself).
Each trace entry has a delta timestamp from the previous entry.
If a trace entry is reserved but and interrupt occurs and traces before
the previous entry is commited, the delta timestamp for that entry will
be zero. This actually makes sense in terms of tracing, because the
interrupt entry happened before the preempted entry was commited, so
one may consider the two happening at the same time. The order is
still preserved in the buffer.
With this idea, instead of trying to get a new timestamp if an interrupt
made it in between the timestamp and the test, the entry could simply
make the delta zero and continue. This will prevent interrupts or
tracers in the timer code from causing the above loop.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: no lockdep debugging of ring buffer
The problem with running lockdep on the ring buffer is that the
ring buffer is the core infrastructure of ftrace. What happens is
that the tracer will start tracing the lockdep code while lockdep
is testing the ring buffers locks. This can cause lockdep to
fail due to testing cases that have not fully finished their
locking transition.
This patch converts the spin locks used by the ring buffer back
into raw spin locks which lockdep does not check.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new, consolidated APIs in ftrace plugins
This patch replaces the schedule safe preempt disable code with the
ftrace_preempt_disable() and ftrace_preempt_enable() safe functions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While writing a new tracer, I had a bug where I caused the ring-buffer
to recurse in a bad way. The bug was with the tracer I was writing
and not the ring-buffer itself. But it took a long time to find the
problem.
This patch adds paranoid checks into the ring-buffer infrastructure
that will catch bugs of this nature.
Note: I put the bug back in the tracer and this patch showed the error
nicely and prevented the lockup.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A powerpc ppc64_defconfig build produces these warnings:
kernel/trace/ring_buffer.c: In function 'rb_add_time_stamp':
kernel/trace/ring_buffer.c:969: warning: format '%llu' expects type 'long long unsigned int', but argument 2 has type 'u64'
kernel/trace/ring_buffer.c:969: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'u64'
kernel/trace/ring_buffer.c:969: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'u64'
Just cast the u64s to unsigned long long like we do everywhere else.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The pages of a buffer was originally pointing to the page struct, it
now points to the page address. The freeing of the page still uses
the page frame free "__free_page" instead of the correct free_page to
the address.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces the local_irq_save/restore with preempt_disable/
enable. This allows for interrupts to enter while recording.
To write to the ring buffer, you must reserve data, and then
commit it. During this time, an interrupt may call a trace function
that will also record into the buffer before the commit is made.
The interrupt will reserve its entry after the first entry, even
though the first entry did not finish yet.
The time stamp delta of the interrupt entry will be zero, since
in the view of the trace, the interrupt happened during the
first field anyway.
Locking still takes place when the tail/write moves from one page
to the next. The reader always takes the locks.
A new page pointer is added, called the commit. The write/tail will
always point to the end of all entries. The commit field will
point to the last committed entry. Only this commit entry may
update the write time stamp.
The reader can only go up to the commit. It cannot go past it.
If a lot of interrupts come in during a commit that fills up the
buffer, and it happens to make it all the way around the buffer
back to the commit, then a warning is printed and new events will
be dropped.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the global head and tail indexes and move them into the
page header. Each page will now keep track of where the last
write and read was made. We also rename the head and tail to read
and write for better clarification.
This patch is needed for future enhancements to move the ring buffer
to a lockless solution.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>