getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.
For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.
This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.
Later, we'll add other information to the struct as it becomes
convenient.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When compiling with user namespace support btrfs fails like:
fs/btrfs/tree-log.c: In function ‘fill_inode_item’:
fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’
fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’
fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’
fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’
Fix this by using i_uid_read and i_gid_read in
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The code needs to be from_kgid(make_kgid(...)...) not
from_kuid(make_kgid(...)...). Doh!
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With user namespace support enabled building bluetooth generated the warning.
net/bluetooth/af_bluetooth.c: In function ‘bt_seq_show’:
net/bluetooth/af_bluetooth.c:598:7: warning: format ‘%u’ expects argument of type ‘unsigned int’, but argument 7 has type ‘kuid_t’ [-Wformat]
Convert sock_i_uid from a kuid_t to a uid_t before printing, to avoid
this problem.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Masatake YAMATO <yamato@redhat.com>
Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use the recently-added bio front_pad field to allocate struct dm_target_io.
Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.
This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.
_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.
This idea was introduced by Kent Overstreet.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to other future DM targets so
move it to a separate module.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to share with future DM targets.
Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.
This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
All the called functions expect interrupts to be enabled, and
now one of them has started to warn about it, so make it correct.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
The device had an undocumented "feature": it can provide a sequence of
spurious link-down status data even if the link is up all the time.
A sequence of 10 was seen so update the link state only after the device
reports the same link state 20 times.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Reported-by: Michael Leun <lkml20120218@newton.leun.net>
Tested-by: Michael Leun <lkml20120218@newton.leun.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in4_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in6_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use be32_to_cpu instead of htonl to keep sparse happy.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
sock case").. tcp resets are always lost, when routing is asymmetric.
Yes, backing out that patch will result in misrouting of resets for
dead connections which used interface binding when were alive, but we
actually cannot do anything here. What's died that's died and correct
handling normal unbound connections is obviously a priority.
Comment to comment:
> This has few benefits:
> 1. tcp_v6_send_reset already did that.
It was done to route resets for IPv6 link local addresses. It was a
mistake to do so for global addresses. The patch fixes this as well.
Actually, the problem appears to be even more serious than guaranteed
loss of resets. As reported by Sergey Soloviev <sol@eqv.ru>, those
misrouted resets create a lot of arp traffic and huge amount of
unresolved arp entires putting down to knees NAT firewalls which use
asymmetric routing.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
call schedule_tail
call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE
This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use list_move() instead of list_del() + list_add().
spatch with a semantic match was used to find this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().
spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
- Register a pfn_is_ram helper to speed up reading of /proc/vmcore.
Bug-fixes:
- Three pvops call for Xen were undefined causing BUG_ONs.
- Add a quirk so that the shutdown watches (used by kdump) are not used with older Xen (3.4).
- Fix ungraceful state transition for the HVC console.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQeBKDAAoJEFjIrFwIi8fJDpwH/3nPBH82pJVxdLPBnmJhWuJR
voSPP0m9i69w/mc7wHtiRwK4lRMAUidgS77iBZkIT2cY0/NYvOKKBlMUitkYJFlK
dTVqr9O4iQcuG2yQk8+mXxC6NLH1VKOnSIyhqRswrePoBKzoHi/x7Y462a+tbxa9
lGBHT9/SqeYXyItRfkdfmAXFZcqIJqLRXEwRMvbky1U3s2QGy7CdIQgra0zWF+t1
ashNpaEBpH9Jy60VSpQtMpx8hWxd0W2NirNu+nACtTE5/MeuiBvKlPdEPC/rUbdJ
c5j5VYLjSxPCheY0sajK6pxKgHdfiqmMRlutzMVj3Egwilb0LBxv1018gRFzBu8=
=/qCG
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen fixes from Konrad Rzeszutek Wilk:
"This has four bug-fixes and one tiny feature that I forgot to put
initially in my tree due to oversight.
The feature is for kdump kernels to speed up the /proc/vmcore reading.
There is a ram_is_pfn helper function that the different platforms can
register for. We are now doing that.
The bug-fixes cover some embarrassing struct pv_cpu_ops variables
being set to NULL on Xen (but not baremetal). We had a similar issue
in the past with {write|read}_msr_safe and this fills the three
missing ones. The other bug-fix is to make the console output (hvc)
be capable of dealing with misbehaving backends and not fall flat on
its face. Lastly, a quirk for older XenBus implementations that came
with an ancient v3.4 hypervisor (so RHEL5 based) - reading of certain
non-existent attributes just hangs the guest during bootup - so we
take precaution of not doing that on such older installations.
Feature:
- Register a pfn_is_ram helper to speed up reading of /proc/vmcore.
Bug-fixes:
- Three pvops call for Xen were undefined causing BUG_ONs.
- Add a quirk so that the shutdown watches (used by kdump) are not
used with older Xen (3.4).
- Fix ungraceful state transition for the HVC console."
* tag 'stable/for-linus-3.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/pv-on-hvm kexec: add quirk for Xen 3.4 and shutdown watches.
xen/bootup: allow {read|write}_cr8 pvops call.
xen/bootup: allow read_tscp call for Xen PV guests.
xen pv-on-hvm: add pfn_is_ram helper for kdump
xen/hvc: handle backend CLOSED without CLOSING
Pull SLAB fix from Pekka Enberg:
"This contains a lockdep false positive fix from Jiri Kosina I missed
from the previous pull request."
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm, slab: release slab_mutex earlier in kmem_cache_destroy()
Pull timer core update from Thomas Gleixner:
- Bug fixes (one for a longstanding dead loop issue)
- Rework of time related vsyscalls
- Alarm timer updates
- Jiffies updates to remove compile time dependencies
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Cast raw_interval to u64 to avoid shift overflow
timers: Fix endless looping between cascade() and internal_add_timer()
time/jiffies: bring back unconditional LATCH definition
time: Convert x86_64 to using new update_vsyscall
time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems
time: Introduce new GENERIC_TIME_VSYSCALL
time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
time: Move update_vsyscall definitions to timekeeper_internal.h
time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
jiffies: Kill unused TICK_USEC_TO_NSEC
alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue
alarmtimer: Remove unused helpers & defines
alarmtimer: Use hrtimer per-alarm instead of per-base
alarmtimer: Implement minimum alarm interval for allowing suspend
Pull scheduler fixes from Ingo Molnar:
"A CPU hotplug related crash fix and a nohz accounting fixlet."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update sched_domains_numa_masks[][] when new cpus are onlined
sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
nohz: Fix one jiffy count too far in idle cputime
Pull RCU fixes from Ingo Molnar:
"This tree includes a shutdown/cpu-hotplug deadlock fix and a
documentation fix."
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Advise most users not to enable RCU user mode
rcu: Grace-period initialization excludes only RCU notifier
The commit 254d1a3f02, titled
"xen/pv-on-hvm kexec: shutdown watches from old kernel" assumes that the
XenBus backend can deal with reading of values from:
"control/platform-feature-xs_reset_watches":
... a patch for xenstored is required so that it
accepts the XS_RESET_WATCHES request from a client (see changeset
23839:42a45baf037d in xen-unstable.hg). Without the patch for xenstored
the registration of watches will fail and some features of a PVonHVM
guest are not available. The guest is still able to boot, but repeated
kexec boots will fail."
Sadly this is not true when using a Xen 3.4 hypervisor and booting a PVHVM
guest. We end up hanging at:
err = xenbus_scanf(XBT_NIL, "control",
"platform-feature-xs_reset_watches", "%d", &supported);
This can easily be seen with guests hanging at xenbus_init:
NX (Execute Disable) protection: active
SMBIOS 2.4 present.
DMI: Xen HVM domU, BIOS 3.4.0 05/13/2011
Hypervisor detected: Xen HVM
Xen version 3.4.
Xen Platform PCI: I/O protocol version 1
... snip ..
calling xenbus_init+0x0/0x27e @ 1
Reverting the commit or using the attached patch fixes the issue. This fix
checks whether the hypervisor is older than 4.0 and if so does not try to
perform the read.
Fixes-Oracle-Bug: 14708233
CC: stable@vger.kernel.org
Acked-by: Olaf Hering <olaf@aepfle.de>
[v2: Added a comment in the source code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We actually do not do anything about it. Just return a default
value of zero and if the kernel tries to write anything but 0
we BUG_ON.
This fixes the case when an user tries to suspend the machine
and it blows up in save_processor_state b/c 'read_cr8' is set
to NULL and we get:
kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100!
invalid opcode: 0000 [#1] SMP
Pid: 2687, comm: init.late Tainted: G O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs
RIP: e030:[<ffffffff814d5f42>] [<ffffffff814d5f42>] save_processor_state+0x212/0x270
.. snip..
Call Trace:
[<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac
[<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150
[<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The hypervisor will trap it. However without this patch,
we would crash as the .read_tscp is set to NULL. This patch
fixes it and sets it to the native_read_tscp call.
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
The con_debug_leave/con_debug_enter functions are stubbed out
by defining them to (0), which causes harmless build warnings.
Using proper inline functions is the normal way to deal with
this.
Without this patch, building the ARM bcm2835_defconfig results in:
drivers/tty/serial/kgdboc.c: In function 'kgdboc_pre_exp_handler':
drivers/tty/serial/kgdboc.c:279:3: warning: statement with no effect [-Wunused-value]
drivers/tty/serial/kgdboc.c: In function 'kgdboc_post_exp_handler':
drivers/tty/serial/kgdboc.c:293:3: warning: statement with no effect [-Wunused-value]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
It is possible to miss data when using the kdb pager. The kdb pager
does not pay attention to the maximum column constraint of the screen
or serial terminal. This result is not incrementing the shown lines
correctly and the pager will print more lines that fit on the screen.
Obviously that is less than useful when using a VGA console where you
cannot scroll back.
The pager will now look at the kdb_buffer string to see how many
characters are printed. It might not be perfect considering you can
output ASCII that might move the cursor position, but it is a
substantially better approximation for viewing dmesg and trace logs.
This also means that the vt screen needs to set the kdb COLUMNS
variable.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
If you press 'q' the pager should exit instead of printing everything
from dmesg which can really bog down a 9600 baud serial link.
The same is true for the bta command.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
It is a common enough mistake for people to specify "kdb" when they
meant to type "kbd" that the kgdboc can just accept both since they
both mean the same thing anyway. Specifically it is for the case
where you want kdb to be active on your graphics console + keyboard
(where kbd was the original abbreviation for keyboard).
With this change kgdboc will now accept either to mean the same thing:
kgdboc=kbd
kgdboc=kdb
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When compiling without CONFIG_DEBUG_RODATA the following
compiler warning is generated:
arch/x86/kernel/kgdb.c: In function 'kgdb_arch_set_breakpoint':
arch/x86/kernel/kgdb.c:749: warning: unused variable 'opc'
The variable instantiation needs to be inside the #ifdef to
make the warning go away.
Reported-by: Thiago Rafael Becker <trbecker@trbecker.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This fault was detected using the kgdb test suite on boot and it
crashes recursively due to the fact that CONFIG_KPROBES on mips adds
an extra die notifier in the page fault handler. The crash signature
looks like this:
kgdbts:RUN bad memory access test
KGDB: re-enter exception: ALL breakpoints killed
Call Trace:
[<807b7548>] dump_stack+0x20/0x54
[<807b7548>] dump_stack+0x20/0x54
The fix for now is to have kgdb return immediately if the fault type
is DIE_PAGE_FAULT and allow the kprobe code to decide what is supposed
to happen.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Allow gdb to auto load kernel modules when it is attached,
which makes it trivially easy to debug module init functions
or pre-set breakpoints in a kernel module that has not loaded yet.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
First, it's incorrect to call putname() after __getname_gfp() since the
bare __getname_gfp() call skips the auditing code, while putname()
doesn't.
mount_block_root allocates a PATH_MAX buffer via __getname_gfp, and then
calls get_fs_names to fill the buffer. That function can call
get_filesystem_list which assumes that that buffer is a full page in
size. On arches where PAGE_SIZE != 4k, then this could potentially
overrun.
In practice, it's hard to imagine the list of filesystem names even
approaching 4k, but it's best to be safe. Just allocate a page for this
purpose instead.
With this, we can also remove the __getname_gfp() definition since there
are no more callers.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.
If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the cases where we already know the length of the parent, pass it as
a parm so we don't need to recompute it. In the cases where we don't
know the length, pass in AUDIT_NAME_FULL (-1) to indicate that it should
be determined.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.
Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.
While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now, we just have two possibilities:
UNKNOWN: for a new audit_names record that we don't know anything about yet
NORMAL: for everything else
In later patches, we'll add other types so we can distinguish and update
records created under different circumstances.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.
Reverse those arguments for a micro-optimization.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>