Make bpf_sk_lookup_tcp, bpf_sk_lookup_udp and bpf_sk_release helpers
available in programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
Such programs operate on sockets and have access to socket and struct
sockaddr passed by user to system calls such as sys_bind, sys_connect,
sys_sendmsg.
It's useful to be able to lookup other sockets from these programs.
E.g. sys_connect may lookup IP:port endpoint and if there is a server
socket bound to that endpoint ("server" can be defined by saddr & sport
being zero), redirect client connection to it by rewriting IP:port in
sockaddr passed to sys_connect.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.
Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.
At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.
sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.
The patch fixes byte order of dport passed to __udp6_lib_lookup.
Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f0
fixes TCPv6 but breaks UDPv6.
Fixes: 5ef0ae84f0 ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch proposes to extend the sk_lookup() BPF API to the
XDP hookpoint. The sk_lookup() helper supports a lookup
on incoming packet to find the corresponding socket that will
receive this packet. Current support for this BPF API is
at the tc hookpoint. This patch will extend this API at XDP
hookpoint. A XDP program can map the incoming packet to the
5-tuple parameter and invoke the API to find the corresponding
socket structure.
Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch allows eBPF programs that use sock_ops to send perf
based event notifications using bpf_perf_event_output(). Our main
use case for this is the following:
We would like to monitor some subset of TCP sockets in user-space,
(the monitoring application would define 4-tuples it wants to monitor)
using TCP_INFO stats to analyze reported problems. The idea is to
use those stats to see where the bottlenecks are likely to be ("is
it application-limited?" or "is there evidence of BufferBloat in
the path?" etc).
Today we can do this by periodically polling for tcp_info, but this
could be made more efficient if the kernel would asynchronously
notify the application via tcp_info when some "interesting"
thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
are reached. And to make this effective, it is better if
we could apply the threshold check *before* constructing the
tcp_info netlink notification, so that we don't waste resources
constructing notifications that will be discarded by the filter.
This work solves the problem by adding perf event based notification
support for sock_ops. The eBPF program can thus be designed to apply
any desired filters to the bpf_sock_ops and trigger a perf event
notification based on the evaluation from the filter. The user space
component can use these perf event notifications to either read any
state managed by the eBPF program, or issue a TCP_INFO netlink call
if desired.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add extack arg to the nla_parse_nested calls in rtnl_newlink, and
add messages for unknown device type and link network namespace id.
In particular, it improves the failure message when the wrong link
type is used. From
$ ip li add bond1 type bonding
RTNETLINK answers: Operation not supported
to
$ ip li add bond1 type bonding
Error: Unknown device type.
(The module name is bonding but the link type is bond.)
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to ip_fib_metrics_init and add messages for invalid
metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to rtnl_create_link and add messages for invalid
number of Tx or Rx queues.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_gro_receive() compares 34 bytes using slow memcmp(),
while handcoding with a couple of ipv6_addr_equal() is much faster.
Before this patch, "perf top -e cycles:pp -C <cpu>" would
see memcmp() using ~10% of cpu cycles on a 40Gbit NIC
receiving IPv6 TCP traffic.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Handle errors mid-stream of an all dump, from Alexey Kodanev.
2) Fix build of openvswitch with certain combinations of netfilter
options, from Arnd Bergmann.
3) Fix interactions between GSO and BQL, from Eric Dumazet.
4) Don't put a '/' in RTL8201F's sysfs file name, from Holger
Hoffstätte.
5) S390 qeth driver fixes from Julian Wiedmann.
6) Allow ipv6 link local addresses for netconsole when both source and
destination are link local, from Matwey V. Kornilov.
7) Fix the BPF program address seen in /proc/kallsyms, from Song Liu.
8) Initialize mutex before use in dsa microchip driver, from Tristram
Ha.
9) Out-of-bounds access in hns3, from Yunsheng Lin.
10) Various netfilter fixes from Stefano Brivio, Jozsef Kadlecsik, Jiri
Slaby, Florian Westphal, Eric Westbrook, Andrey Ryabinin, and Pablo
Neira Ayuso.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (50 commits)
net: alx: make alx_drv_name static
net: bpfilter: fix iptables failure if bpfilter_umh is disabled
sock_diag: fix autoloading of the raw_diag module
net: core: netpoll: Enable netconsole IPv6 link local address
ipv6: properly check return value in inet6_dump_all()
rtnetlink: restore handling of dumpit return value in rtnl_dump_all()
net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
bonding/802.3ad: fix link_failure_count tracking
net: phy: realtek: fix RTL8201F sysfs name
sctp: define SCTP_SS_DEFAULT for Stream schedulers
sctp: fix strchange_flags name for Stream Change Event
mlxsw: spectrum: Fix IP2ME CPU policer configuration
openvswitch: fix linking without CONFIG_NF_CONNTRACK_LABELS
qed: fix link config error handling
net: hns3: Fix for out-of-bounds access when setting pfc back pressure
net/mlx4_en: use __netdev_tx_sent_queue()
net: do not abort bulk send on BQL status
net: bql: add __netdev_tx_sent_queue()
s390/qeth: report 25Gbit link speed
s390/qeth: sanitize ARP requests
...
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains the first batch of Netfilter fixes for
your net tree:
1) Fix splat with IPv6 defragmenting locally generated fragments,
from Florian Westphal.
2) Fix Incorrect check for missing attribute in nft_osf.
3) Missing INT_MIN & INT_MAX definition for netfilter bridge uapi
header, from Jiri Slaby.
4) Revert map lookup in nft_numgen, this is already possible with
the existing infrastructure without this extension.
5) Fix wrong listing of set reference counter, make counter
synchronous again, from Stefano Brivio.
6) Fix CIDR 0 in hash:net,port,net, from Eric Westbrook.
7) Fix allocation failure with large set, use kvcalloc().
From Andrey Ryabinin.
8) No need to disable BH when fetch ip set comment, patch from
Jozsef Kadlecsik.
9) Sanity check for valid sysfs entry in xt_IDLETIMER, from
Taehee Yoo.
10) Fix suspicious rcu usage via ip_set() macro at netlink dump,
from Jozsef Kadlecsik.
11) Fix setting default timeout via nfnetlink_cttimeout, this
comes with preparation patch to add nf_{tcp,udp,...}_pernet()
helper.
12) Allow ebtables table nat to be of filter type via nft_compat.
From Florian Westphal.
13) Incorrect calculation of next bucket in early_drop, do no bump
hash value, update bucket counter instead. From Vasily Khoruzhick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When iptables command is executed, ip_{set/get}sockopt() try to upload
bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
command is failed.
bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
iptables command is always failed.
test config:
CONFIG_BPFILTER=y
# CONFIG_BPFILTER_UMH is not set
test command:
%iptables -L
iptables: No chain/target/match by that name.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPPROTO_RAW isn't registred as an inet protocol, so
inet_protos[protocol] is always NULL for it.
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Fixes: bf2ae2e4bf ("sock_diag: request _diag module only when the family or proto has been registered")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to discard using source link local address when
remote netconsole IPv6 address is set to be link local one.
The patch allows administrators to use IPv6 netconsole without
explicitly configuring source address:
netconsole=@/,@fe80::5054:ff:fe2f:6012/
Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.
Fixes: 08e814c9e8 ("net/ipv6: Bail early if user only wants cloned entries")
Fixes: ae677bbb44 ("net: Don't return invalid table id error when dumping all families")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For non-zero return from dumpit() we should break the loop
in rtnl_dump_all() and return the result. Otherwise, e.g.,
we could get the memory leak in inet6_dump_fib() [1]. The
pointer to the allocated struct fib6_walker there (saved
in cb->args) can be lost, reset on the next iteration.
Fix it by partially restoring the previous behavior before
commit c63586dc9b ("net: rtnl_dump_all needs to propagate
error from dumpit function"). The returned error from
dumpit() is still passed further.
[1]:
unreferenced object 0xffff88001322a200 (size 96):
comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff ..A6......A6....
backtrace:
[<0000000095846b39>] kmem_cache_alloc_trace+0x151/0x220
[<000000007d12709f>] inet6_dump_fib+0x68d/0x940
[<000000002775a316>] rtnl_dump_all+0x1d9/0x2d0
[<00000000d7cd302b>] netlink_dump+0x945/0x11a0
[<000000002f43485f>] __netlink_dump_start+0x55d/0x800
[<00000000f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
[<000000009b5761f3>] netlink_rcv_skb+0x29c/0x420
[<0000000087a1dae1>] rtnetlink_rcv+0x15/0x20
[<00000000691b703b>] netlink_unicast+0x4e3/0x6c0
[<00000000b5be0204>] netlink_sendmsg+0x7f2/0xba0
[<0000000096d2aa60>] sock_sendmsg+0xba/0xf0
[<000000008c1b786f>] __sys_sendto+0x1e4/0x330
[<0000000019587b3f>] __x64_sys_sendto+0xe1/0x1a0
[<00000000071f4d56>] do_syscall_64+0x9f/0x300
[<000000002737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0000000057587684>] 0xffffffffffffffff
Fixes: c63586dc9b ("net: rtnl_dump_all needs to propagate error from dumpit function")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the anycast.c init and cleanup functions which were inadvertently
added inside the CONFIG_PROC_FS definition.
Fixes: 2384d02520 ("net/ipv6: Add anycast addresses to a global hashtable")
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to rfc8260#section-4.3.2, SCTP_SS_DEFAULT is required to
defined as SCTP_SS_FCFS or SCTP_SS_RR.
SCTP_SS_FCFS is used for SCTP_SS_DEFAULT's value in this patch.
Fixes: 5bbbbe32a4 ("sctp: introduce stream scheduler foundations")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_CC_OPTIMIZE_FOR_DEBUGGING is enabled, the compiler
fails to optimize out a dead code path, which leads to a link failure:
net/openvswitch/conntrack.o: In function `ovs_ct_set_labels':
conntrack.c:(.text+0x2e60): undefined reference to `nf_connlabels_replace'
In this configuration, we can take a shortcut, and completely
remove the contrack label code. This may also help the regular
optimization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Ingo Molnar:
"A number of fixes and some late updates:
- make in_compat_syscall() behavior on x86-32 similar to other
platforms, this touches a number of generic files but is not
intended to impact non-x86 platforms.
- objtool fixes
- PAT preemption fix
- paravirt fixes/cleanups
- cpufeatures updates for new instructions
- earlyprintk quirk
- make microcode version in sysfs world-readable (it is already
world-readable in procfs)
- minor cleanups and fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
compat: Cleanup in_compat_syscall() callers
x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
objtool: Support GCC 9 cold subfunction naming scheme
x86/numa_emulation: Fix uniform-split numa emulation
x86/paravirt: Remove unused _paravirt_ident_32
x86/mm/pat: Disable preemption around __flush_tlb_all()
x86/paravirt: Remove GPL from pv_ops export
x86/traps: Use format string with panic() call
x86: Clean up 'sizeof x' => 'sizeof(x)'
x86/cpufeatures: Enumerate MOVDIR64B instruction
x86/cpufeatures: Enumerate MOVDIRI instruction
x86/earlyprintk: Add a force option for pciserial device
objtool: Support per-function rodata sections
x86/microcode: Make revision and processor flags world-readable
Before calling dev_hard_start_xmit(), upper layers tried
to cook optimal skb list based on BQL budget.
Problem is that GSO packets can end up comsuming more than
the BQL budget.
Breaking the loop is not useful, since requeued packets
are ahead of any packets still in the qdisc.
It is also more expensive, since next TX completion will
push these packets later, while skbs are not in cpu caches.
It is also a behavior difference with TSO packets, that can
break the BQL limit by a large amount.
Note that drivers should use __netdev_tx_sent_queue()
in order to have optimal xmit_more support, and avoid
useless atomic operations as shown in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull 9p fix from Al Viro:
"Regression fix for net/9p handling of iov_iter; broken by braino when
switching to iov_iter_is_kvec() et.al., spotted and fixed by Marc"
* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
iov_iter: Fix 9p virtio breakage
If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.
Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.
Fixes: 3e86638e9a ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike ip(6)tables, the ebtables nat table has no special properties.
This bug causes 'ebtables -A' to fail when using a target such as
'snat' (ebt_snat target sets ".table = "nat"'). Targets that have
no table restrictions work fine.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, we hit a NULL pointer deference since handlers always assume
default timeout policy is passed.
netlink: 24 bytes leftover after parsing attributes in process `syz-executor2'.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9575 Comm: syz-executor1 Not tainted 4.19.0+ #312
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:icmp_timeout_obj_to_nlattr+0x77/0x170 net/netfilter/nf_conntrack_proto_icmp.c:297
Fixes: c779e84960 ("netfilter: conntrack: remove get_timeout() indirection")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The ip_set() macro is called when either ip_set_ref_lock held only
or no lock/nfnl mutex is held at dumping. Take this into account
properly. Also, use Pablo's suggestion to use rcu_dereference_raw(),
the ref_netlink protects the set.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the network becomes (partially) unavailable, say by disabling IPv6, the
background ACK transmission routine can get itself into a tizzy by
proposing immediate ACK retransmission. Since we're in the call event
processor, that happens immediately without returning to the workqueue
manager.
The condition should clear after a while when either the network comes back
or the call times out.
Fix this by:
(1) When re-proposing an ACK on failed Tx, don't schedule it immediately.
This will allow a certain amount of time to elapse before we try
again.
(2) Enforce a return to the workqueue manager after a certain number of
iterations of the call processing loop.
(3) Add a backoff delay that increases the delay on deferred ACKs by a
jiffy per failed transmission to a limit of HZ. The backoff delay is
cleared on a successful return from kernel_sendmsg().
(4) Cancel calls immediately if the opening sendmsg fails. The layer
above can arrange retransmission or rotate to another server.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device. This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().
Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses. This is based on inet6_addr_lst in addrconf.c.
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove kernel-doc warning:
net/core/skbuff.c:4953: warning: Function parameter or member 'skb' not described in 'skb_gso_size_check'
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"
* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
...
sunrpc patches from nfs tree conflict with calling conventions change done
in iov_iter work. Trivial fixup...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.
In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.
Note, prior to commit 88078d98d1, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.
Found this during code review.
Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The seq_send & seq_send64 fields in struct krb5_ctx are used as
atomically incrementing counters. This is implemented using cmpxchg() &
cmpxchg64() to implement what amount to custom versions of
atomic_fetch_inc() & atomic64_fetch_inc().
Besides the duplication, using cmpxchg64() has another major drawback in
that some 32 bit architectures don't provide it. As such commit
571ed1fd23 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
resulted in build failures for some architectures.
Change seq_send to be an atomic_t and seq_send64 to be an atomic64_t,
then use atomic(64)_* functions to manipulate the values. The atomic64_t
type & associated functions are provided even on architectures which
lack real 64 bit atomic memory access via CONFIG_GENERIC_ATOMIC64 which
uses spinlocks to serialize access. This fixes the build failures for
architectures lacking cmpxchg64().
A potential alternative that was raised would be to provide cmpxchg64()
on the 32 bit architectures that currently lack it, using spinlocks.
However this would provide a version of cmpxchg64() with semantics a
little different to the implementations on architectures with real 64
bit atomics - the spinlock-based implementation would only work if all
access to the memory used with cmpxchg64() is *always* performed using
cmpxchg64(). That is not currently a requirement for users of
cmpxchg64(), and making it one seems questionable. As such avoiding
cmpxchg64() outside of architecture-specific code seems best,
particularly in cases where atomic64_t seems like a better fit anyway.
The CONFIG_GENERIC_ATOMIC64 implementation of atomic64_* functions will
use spinlocks & so faces the same issue, but with the key difference
that the memory backing an atomic64_t ought to always be accessed via
the atomic64_* functions anyway making the issue moot.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 571ed1fd23 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull networking fixes from David Miller:
1) BPF verifier fixes from Daniel Borkmann.
2) HNS driver fixes from Huazhong Tan.
3) FDB only works for ethernet devices, reject attempts to install FDB
rules for others. From Ido Schimmel.
4) Fix spectre V1 in vhost, from Jason Wang.
5) Don't pass on-stack object to irq_set_affinity_hint() in mvpp2
driver, from Marc Zyngier.
6) Fix mlx5e checksum handling when RXFCS is enabled, from Eric
Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (49 commits)
openvswitch: Fix push/pop ethernet validation
net: stmmac: Fix stmmac_mdio_reset() when building stmmac as modules
bpf: test make sure to run unpriv test cases in test_verifier
bpf: add various test cases to test_verifier
bpf: don't set id on after map lookup with ptr_to_map_val return
bpf: fix partial copy of map_ptr when dst is scalar
libbpf: Fix compile error in libbpf_attach_type_by_name
kselftests/bpf: use ping6 as the default ipv6 ping binary if it exists
selftests: mlxsw: qos_mc_aware: Add a test for UC awareness
selftests: mlxsw: qos_mc_aware: Tweak for min shaper
mlxsw: spectrum: Set minimum shaper on MC TCs
mlxsw: reg: QEEC: Add minimum shaper fields
net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset()
net: hns3: bugfix for rtnl_lock's range in the hclge_reset()
net: hns3: bugfix for handling mailbox while the command queue reinitialized
net: hns3: fix incorrect return value/type of some functions
net: hns3: bugfix for hclge_mdio_write and hclge_mdio_read
net: hns3: bugfix for is_valid_csq_clean_head()
net: hns3: remove unnecessary queue reset in the hns3_uninit_all_ring()
net: hns3: bugfix for the initialization of command queue's spin lock
...
Now that in_compat_syscall() is consistent on all architectures and does
not longer report true on native i686, the workarounds (ifdeffery and
helpers) can be removed.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-efi@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: https://lkml.kernel.org/r/20181012134253.23266-3-dima@arista.com
When there are both pop and push ethernet header actions among the
actions to be applied to a packet, an unexpected EINVAL (Invalid
argument) error is obtained. This is due to mac_proto not being reset
correctly when those actions are validated.
Reported-at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
Fixes: 91820da6ae ("openvswitch: add Ethernet push and pop actions")
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@suse.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2018-10-31
This series contains a various collection of fixes.
Miroslav Lichvar from Red Hat or should I say IBM now? Updates the PHC
timecounter interval for igb so that it gets updated at least once
every 550 seconds.
Ngai-Mint provides a fix for fm10k to prevent a soft lockup or system
crash by adding a new condition to determine if the SM mailbox is in the
correct state before proceeding.
Jake provides several fm10k fixes, first one marks complier aborts as
non-fatal since on some platforms trigger machine check errors when the
compile aborts. Added missing device ids to the in-kernel driver. Due
to the recent fixes, bumped the driver version.
I (Jeff Kirsher) fixed a XFRM_ALGO dependency for both ixgbe and
ixgbevf. This fix was based on the original work from Arnd Bergmann,
which only fixed ixgbe.
Mitch provides a fix for i40e/avf to update the status codes, which
resolves an issue between a mis-match between i40e and the iavf driver,
which also supports the ice LAN driver.
Radoslaw fixes the ixgbe where the driver is logging a message about
spoofed packets detected when the VF is re-started with a different MAC
address.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_set_create() and ip_set_net_init() attempt to allocate physically
contiguous memory for ip_set_list. If memory is fragmented, the
allocations could easily fail:
vzctl: page allocation failure: order:7, mode:0xc0d0
Call Trace:
dump_stack+0x19/0x1b
warn_alloc_failed+0x110/0x180
__alloc_pages_nodemask+0x7bf/0xc60
alloc_pages_current+0x98/0x110
kmalloc_order+0x18/0x40
kmalloc_order_trace+0x26/0xa0
__kmalloc+0x279/0x290
ip_set_net_init+0x4b/0x90 [ip_set]
ops_init+0x3b/0xb0
setup_net+0xbb/0x170
copy_net_ns+0xf1/0x1c0
create_new_namespaces+0xf9/0x180
copy_namespaces+0x8e/0xd0
copy_process+0xb61/0x1a00
do_fork+0x91/0x320
Use kvcalloc() to fallback to 0-order allocations if high order
page isn't available.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow /0 as advertised for hash:net,port,net sets.
For "hash:net,port,net", ipset(8) says that "either subnet
is permitted to be a /0 should you wish to match port
between all destinations."
Make that statement true.
Before:
# ipset create cidrzero hash:net,port,net
# ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
ipset v6.34: The value of the CIDR parameter of the IP address is invalid
# ipset create cidrzero6 hash:net,port,net family inet6
# ipset add cidrzero6 ::/0,12345,::/0
ipset v6.34: The value of the CIDR parameter of the IP address is invalid
After:
# ipset create cidrzero hash:net,port,net
# ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
# ipset test cidrzero 192.168.205.129,12345,172.16.205.129
192.168.205.129,tcp:12345,172.16.205.129 is in set cidrzero.
# ipset create cidrzero6 hash:net,port,net family inet6
# ipset add cidrzero6 ::/0,12345,::/0
# ipset test cidrzero6 fe80::1,12345,ff00::1
fe80::1,tcp:12345,ff00::1 is in set cidrzero6.
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=200897df7ff6efb0
Signed-off-by: Eric Westbrook <linux@westbrook.io>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 45040978c8 ("netfilter: ipset: Fix set:list type crash
when flush/dump set in parallel") postponed decreasing set
reference counters to the RCU callback.
An 'ipset del' command can terminate before the RCU grace period
is elapsed, and if sets are listed before then, the reference
counter shown in userspace will be wrong:
# ipset create h hash:ip; ipset create l list:set; ipset add l
# ipset del l h; ipset list h
Name: h
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 88
References: 1
Number of entries: 0
Members:
# sleep 1; ipset list h
Name: h
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 88
References: 0
Number of entries: 0
Members:
Fix this by making the reference count update synchronous again.
As a result, when sets are listed, ip_set_name_byindex() might
now fetch a set whose reference count is already zero. Instead
of relying on the reference count to protect against concurrent
set renaming, grab ip_set_ref_lock as reader and copy the name,
while holding the same lock in ip_set_rename() as writer
instead.
Reported-by: Li Shuang <shuali@redhat.com>
Fixes: 45040978c8 ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- a series that fixes some old memory allocation issues in libceph
(myself). We no longer allocate memory in places where allocation
failures cannot be handled and BUG when the allocation fails.
- support for copy_file_range() syscall (Luis Henriques). If size and
alignment conditions are met, it leverages RADOS copy-from operation.
Otherwise, a local copy is performed.
- a patch that reduces memory requirement of ceph_sync_read() from the
size of the entire read to the size of one object (Zheng Yan).
- fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
Henriques)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlvZ6AcTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8H+B/9V/QB1BX5Q2DvkS3mcLNI2NphrppaD
VBuviwoIzaBm1paCrx40J/pCtsK1Fybl5dBAh1W0SDxEGR8JUA8GJw+oemtOS6pZ
DwjOF9S7uhzf5M3nQ9SvAbIudBISMZQRi22Y8fWs3k+yaECIz1J/pe7RiKo/GBAB
NnlbrZ1AYSB02chchVCSmWTApeIRp9JXnaM9xLMJWGVLL/vONjt3ltJ/w9haGYz8
FPFLPFeWobWqFElnOUomxU8Cv84DgPtH8si0UAn16jveractpFJWO4X6LDs/ZYDk
/MccfsB3EK9BCJdLJMoI0/lXxE33z3/MehmJDs9xGSX/N4N7UTF8Ve1b
=U91e
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- a series that fixes some old memory allocation issues in libceph
(myself). We no longer allocate memory in places where allocation
failures cannot be handled and BUG when the allocation fails.
- support for copy_file_range() syscall (Luis Henriques). If size and
alignment conditions are met, it leverages RADOS copy-from
operation. Otherwise, a local copy is performed.
- a patch that reduces memory requirement of ceph_sync_read() from
the size of the entire read to the size of one object (Zheng Yan).
- fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
Henriques)"
* tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client: (25 commits)
ceph: new mount option to disable usage of copy-from op
ceph: support copy_file_range file operation
libceph: support the RADOS copy-from operation
ceph: add non-blocking parameter to ceph_try_get_caps()
libceph: check reply num_data_items in setup_request_data()
libceph: preallocate message data items
libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
libceph: introduce alloc_watch_request()
libceph: assign cookies in linger_submit()
libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get()
ceph: num_ops is off by one in ceph_aio_retry_work()
libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op()
ceph: set timeout conditionally in __cap_delay_requeue
libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
libceph: introduce ceph_pagelist_alloc()
libceph: osd_req_op_cls_init() doesn't need to take opcode
libceph: bump CEPH_MSG_MAX_DATA_LEN
ceph: only allow punch hole mode in fallocate
ceph: refactor ceph_sync_read()
ceph: check if LOOKUPNAME request was aborted when filling trace
...
Based on the original work from Arnd Bergmann.
When XFRM_ALGO is not enabled, the new ixgbe IPsec code produces a
link error:
drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.o: In function `ixgbe_ipsec_vf_add_sa':
ixgbe_ipsec.c:(.text+0x1266): undefined reference to `xfrm_aead_get_byname'
Simply selecting XFRM_ALGO from here causes circular dependencies, so
to fix it, we probably want this slightly more complex solution that is
similar to what other drivers with XFRM offload do:
A separate Kconfig symbol now controls whether we include the IPsec
offload code. To keep the old behavior, this is left as 'default y'. The
dependency in XFRM_OFFLOAD still causes a circular dependency but is
not actually needed because this symbol is not user visible, so removing
that dependency on top makes it all work.
CC: Arnd Bergmann <arnd@arndb.de>
CC: Shannon Nelson <shannon.nelson@oracle.com>
Fixes: eda0333ac2 ("ixgbe: add VF IPsec management")
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Merge more updates from Andrew Morton:
- the rest of MM
- lib/bitmap updates
- hfs updates
- fatfs updates
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
mm/gup.c: fix __get_user_pages_fast() comment
mm: Fix warning in insert_pfn()
memory-hotplug.rst: add some details about locking internals
powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()
powerpc/powernv: hold device_hotplug_lock when calling device_online()
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
mm/memory_hotplug: make add_memory() take the device_hotplug_lock
mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
mm/memblock.c: warn if zero alignment was requested
memblock: stop using implicit alignment to SMP_CACHE_BYTES
docs/boot-time-mm: remove bootmem documentation
mm: remove include/linux/bootmem.h
memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
mm: remove nobootmem
memblock: rename __free_pages_bootmem to memblock_free_pages
memblock: rename free_all_bootmem to memblock_free_all
memblock: replace free_bootmem_late with memblock_free_late
memblock: replace free_bootmem{_node} with memblock_free
mm: nobootmem: remove bootmem allocation APIs
memblock: replace alloc_bootmem with memblock_alloc
...
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.
The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>
@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>
[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We return 0 in the case of a nonblocking socket that has no data
available. However, this is incorrect and may confuse applications.
After this patch we do the correct thing and return the error
EAGAIN.
Quoting return codes from recvmsg manpage,
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the receive operation would
block, or a receive timeout had been set and the timeout expired
before data was received.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>