License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2007-09-12 13:50:50 +04:00
|
|
|
/*
|
|
|
|
* Operations on the network namespace
|
|
|
|
*/
|
|
|
|
#ifndef __NET_NET_NAMESPACE_H
|
|
|
|
#define __NET_NET_NAMESPACE_H
|
|
|
|
|
2011-07-27 03:09:06 +04:00
|
|
|
#include <linux/atomic.h>
|
2017-06-30 13:08:08 +03:00
|
|
|
#include <linux/refcount.h>
|
2007-09-12 13:50:50 +04:00
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/list.h>
|
2011-05-27 00:40:37 +04:00
|
|
|
#include <linux/sysctl.h>
|
2018-07-21 00:56:53 +03:00
|
|
|
#include <linux/uidgid.h>
|
2007-09-12 13:50:50 +04:00
|
|
|
|
ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif
As suggested by Julian:
Simply, flowi4_iif must not contain 0, it does not
look logical to ignore all ip rules with specified iif.
because in fib_rule_match() we do:
if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
goto out;
flowi4_iif should be LOOPBACK_IFINDEX by default.
We need to move LOOPBACK_IFINDEX to include/net/flow.h:
1) It is mostly used by flowi_iif
2) Fix the following compile error if we use it in flow.h
by the patches latter:
In file included from include/linux/netfilter.h:277:0,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:21,
from include/linux/netdevice.h:43,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:61,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:27,
from include/linux/nfs_fs.h:30,
from init/do_mounts.c:32:
include/net/flow.h: In function ‘flowi4_init_output’:
include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-16 03:25:34 +04:00
|
|
|
#include <net/flow.h>
|
2008-04-01 06:41:14 +04:00
|
|
|
#include <net/netns/core.h>
|
2008-07-18 15:01:24 +04:00
|
|
|
#include <net/netns/mib.h>
|
2007-12-11 15:19:17 +03:00
|
|
|
#include <net/netns/unix.h>
|
2007-12-11 15:19:54 +03:00
|
|
|
#include <net/netns/packet.h>
|
2007-12-17 00:29:36 +03:00
|
|
|
#include <net/netns/ipv4.h>
|
2008-01-10 13:49:06 +03:00
|
|
|
#include <net/netns/ipv6.h>
|
2019-05-25 00:43:04 +03:00
|
|
|
#include <net/netns/nexthop.h>
|
2014-02-28 10:32:49 +04:00
|
|
|
#include <net/netns/ieee802154_6lowpan.h>
|
2012-08-06 12:42:04 +04:00
|
|
|
#include <net/netns/sctp.h>
|
2013-03-25 03:50:39 +04:00
|
|
|
#include <net/netns/netfilter.h>
|
2008-01-31 15:02:13 +03:00
|
|
|
#include <net/netns/x_tables.h>
|
2008-10-08 13:35:02 +04:00
|
|
|
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
|
|
|
|
#include <net/netns/conntrack.h>
|
|
|
|
#endif
|
2013-10-11 01:28:33 +04:00
|
|
|
#include <net/netns/nftables.h>
|
2008-11-26 04:14:31 +03:00
|
|
|
#include <net/netns/xfrm.h>
|
2015-03-04 04:10:47 +03:00
|
|
|
#include <net/netns/mpls.h>
|
2017-02-21 14:19:47 +03:00
|
|
|
#include <net/netns/can.h>
|
2019-01-24 21:59:37 +03:00
|
|
|
#include <net/netns/xdp.h>
|
2020-05-31 11:28:36 +03:00
|
|
|
#include <net/netns/bpf.h>
|
2014-11-01 05:56:04 +03:00
|
|
|
#include <linux/ns_common.h>
|
2015-06-17 18:28:25 +03:00
|
|
|
#include <linux/idr.h>
|
|
|
|
#include <linux/skbuff.h>
|
2019-09-30 11:15:10 +03:00
|
|
|
#include <linux/notifier.h>
|
2007-12-11 15:19:17 +03:00
|
|
|
|
2012-06-14 13:31:10 +04:00
|
|
|
struct user_namespace;
|
2007-09-12 14:01:34 +04:00
|
|
|
struct proc_dir_entry;
|
2007-09-27 09:10:56 +04:00
|
|
|
struct net_device;
|
2007-11-20 09:26:51 +03:00
|
|
|
struct sock;
|
2007-12-01 15:51:01 +03:00
|
|
|
struct ctl_table_header;
|
2008-04-15 11:36:08 +04:00
|
|
|
struct net_generic;
|
2018-03-19 15:17:30 +03:00
|
|
|
struct uevent_sock;
|
2011-03-04 13:18:07 +03:00
|
|
|
struct netns_ipvs;
|
2018-09-14 17:46:18 +03:00
|
|
|
struct bpf_prog;
|
2007-12-01 15:51:01 +03:00
|
|
|
|
2009-10-24 17:13:17 +04:00
|
|
|
|
|
|
|
#define NETDEV_HASHBITS 8
|
|
|
|
#define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
|
|
|
|
|
2007-09-12 13:50:50 +04:00
|
|
|
struct net {
|
2019-10-19 01:20:05 +03:00
|
|
|
/* First cache line can be often dirtied.
|
|
|
|
* Do not place here read-mostly fields.
|
|
|
|
*/
|
2019-08-21 14:29:29 +03:00
|
|
|
refcount_t passive; /* To decide when the network
|
2011-06-09 05:13:01 +04:00
|
|
|
* namespace should be freed.
|
|
|
|
*/
|
2010-10-14 09:56:18 +04:00
|
|
|
spinlock_t rules_mod_lock;
|
|
|
|
|
2019-10-19 01:20:05 +03:00
|
|
|
unsigned int dev_unreg_count;
|
|
|
|
|
|
|
|
unsigned int dev_base_seq; /* protected by rtnl_mutex */
|
|
|
|
int ifindex;
|
|
|
|
|
|
|
|
spinlock_t nsid_lock;
|
|
|
|
atomic_t fnhe_genid;
|
2015-03-12 04:53:14 +03:00
|
|
|
|
2007-09-12 13:50:50 +04:00
|
|
|
struct list_head list; /* list of network namespaces */
|
2018-02-19 12:58:38 +03:00
|
|
|
struct list_head exit_list; /* To linked to call pernet exit
|
2018-03-27 18:02:23 +03:00
|
|
|
* methods on dead net (
|
|
|
|
* pernet_ops_rwsem read locked),
|
|
|
|
* or to unregister pernet ops
|
|
|
|
* (pernet_ops_rwsem write locked).
|
2018-02-19 12:58:38 +03:00
|
|
|
*/
|
2018-02-19 12:58:45 +03:00
|
|
|
struct llist_node cleanup_list; /* namespaces on death row */
|
|
|
|
|
2019-06-26 23:02:33 +03:00
|
|
|
#ifdef CONFIG_KEYS
|
|
|
|
struct key_tag *key_domain; /* Key domain of operation tag */
|
|
|
|
#endif
|
2012-06-14 13:31:10 +04:00
|
|
|
struct user_namespace *user_ns; /* Owning user namespace */
|
2016-08-08 22:33:23 +03:00
|
|
|
struct ucounts *ucounts;
|
2015-01-15 17:11:15 +03:00
|
|
|
struct idr netns_ids;
|
2012-06-14 13:31:10 +04:00
|
|
|
|
2014-11-01 05:56:04 +03:00
|
|
|
struct ns_common ns;
|
2011-06-15 21:21:48 +04:00
|
|
|
|
2019-10-19 01:20:05 +03:00
|
|
|
struct list_head dev_base_head;
|
2007-09-12 14:01:34 +04:00
|
|
|
struct proc_dir_entry *proc_net;
|
|
|
|
struct proc_dir_entry *proc_net_stat;
|
2007-09-17 22:56:21 +04:00
|
|
|
|
2008-07-15 05:22:20 +04:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
struct ctl_table_set sysctls;
|
|
|
|
#endif
|
2007-11-30 15:55:42 +03:00
|
|
|
|
2010-10-14 09:56:18 +04:00
|
|
|
struct sock *rtnl; /* rtnetlink socket */
|
|
|
|
struct sock *genl_sock;
|
2007-09-27 09:10:56 +04:00
|
|
|
|
2018-03-19 15:17:30 +03:00
|
|
|
struct uevent_sock *uevent_sock; /* uevent socket */
|
|
|
|
|
2007-09-17 22:56:21 +04:00
|
|
|
struct hlist_head *dev_name_head;
|
|
|
|
struct hlist_head *dev_index_head;
|
2019-09-30 11:15:10 +03:00
|
|
|
struct raw_notifier_head netdev_chain;
|
|
|
|
|
2019-10-19 01:20:05 +03:00
|
|
|
/* Note that @hash_mix can be read millions times per second,
|
|
|
|
* it is critical that it is on a read_mostly cache line.
|
|
|
|
*/
|
|
|
|
u32 hash_mix;
|
|
|
|
|
|
|
|
struct net_device *loopback_dev; /* The loopback */
|
2007-11-20 09:26:51 +03:00
|
|
|
|
2008-01-10 14:20:28 +03:00
|
|
|
/* core fib_rules */
|
|
|
|
struct list_head rules_ops;
|
|
|
|
|
2008-04-01 06:41:14 +04:00
|
|
|
struct netns_core core;
|
2008-07-18 15:01:24 +04:00
|
|
|
struct netns_mib mib;
|
2007-12-11 15:19:54 +03:00
|
|
|
struct netns_packet packet;
|
2007-12-11 15:19:17 +03:00
|
|
|
struct netns_unix unx;
|
2019-05-25 00:43:04 +03:00
|
|
|
struct netns_nexthop nexthop;
|
2007-12-17 00:29:36 +03:00
|
|
|
struct netns_ipv4 ipv4;
|
2011-12-10 13:48:31 +04:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2008-01-10 13:49:06 +03:00
|
|
|
struct netns_ipv6 ipv6;
|
|
|
|
#endif
|
2014-02-28 10:32:49 +04:00
|
|
|
#if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
|
|
|
|
struct netns_ieee802154_lowpan ieee802154_lowpan;
|
|
|
|
#endif
|
2012-08-06 12:42:04 +04:00
|
|
|
#if defined(CONFIG_IP_SCTP) || defined(CONFIG_IP_SCTP_MODULE)
|
|
|
|
struct netns_sctp sctp;
|
|
|
|
#endif
|
2008-01-31 15:02:13 +03:00
|
|
|
#ifdef CONFIG_NETFILTER
|
2013-03-25 03:50:39 +04:00
|
|
|
struct netns_nf nf;
|
2008-01-31 15:02:13 +03:00
|
|
|
struct netns_xt xt;
|
2008-10-08 13:35:02 +04:00
|
|
|
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
|
|
|
|
struct netns_ct ct;
|
2012-09-18 20:50:08 +04:00
|
|
|
#endif
|
2013-10-11 01:28:33 +04:00
|
|
|
#if defined(CONFIG_NF_TABLES) || defined(CONFIG_NF_TABLES_MODULE)
|
|
|
|
struct netns_nftables nft;
|
|
|
|
#endif
|
2008-11-26 04:14:31 +03:00
|
|
|
#endif
|
2009-09-30 01:27:28 +04:00
|
|
|
#ifdef CONFIG_WEXT_CORE
|
2009-06-24 05:34:48 +04:00
|
|
|
struct sk_buff_head wext_nlevents;
|
2008-01-31 15:02:13 +03:00
|
|
|
#endif
|
2010-10-25 07:20:11 +04:00
|
|
|
struct net_generic __rcu *gen;
|
2010-10-14 09:56:18 +04:00
|
|
|
|
2020-05-31 11:28:36 +03:00
|
|
|
/* Used to store attached BPF programs */
|
|
|
|
struct netns_bpf bpf;
|
2018-09-14 17:46:18 +03:00
|
|
|
|
2010-10-14 09:56:18 +04:00
|
|
|
/* Note : following structs are cache line aligned */
|
|
|
|
#ifdef CONFIG_XFRM
|
|
|
|
struct netns_xfrm xfrm;
|
|
|
|
#endif
|
bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.
In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.
On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.
(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 18:58:52 +03:00
|
|
|
|
2021-02-10 17:41:44 +03:00
|
|
|
u64 net_cookie; /* written once */
|
bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.
In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.
On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.
(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 18:58:52 +03:00
|
|
|
|
2013-06-26 12:40:06 +04:00
|
|
|
#if IS_ENABLED(CONFIG_IP_VS)
|
2011-01-03 16:44:42 +03:00
|
|
|
struct netns_ipvs *ipvs;
|
2015-03-04 04:10:47 +03:00
|
|
|
#endif
|
|
|
|
#if IS_ENABLED(CONFIG_MPLS)
|
|
|
|
struct netns_mpls mpls;
|
2017-02-21 14:19:47 +03:00
|
|
|
#endif
|
|
|
|
#if IS_ENABLED(CONFIG_CAN)
|
|
|
|
struct netns_can can;
|
2019-01-24 21:59:37 +03:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_XDP_SOCKETS
|
|
|
|
struct netns_xdp xdp;
|
2019-07-09 14:11:24 +03:00
|
|
|
#endif
|
|
|
|
#if IS_ENABLED(CONFIG_CRYPTO_USER)
|
|
|
|
struct sock *crypto_nlsk;
|
2013-06-26 12:40:06 +04:00
|
|
|
#endif
|
2012-07-16 08:28:49 +04:00
|
|
|
struct sock *diag_nlsk;
|
2016-10-28 11:22:25 +03:00
|
|
|
} __randomize_layout;
|
2007-09-12 13:50:50 +04:00
|
|
|
|
2008-04-02 11:10:28 +04:00
|
|
|
#include <linux/seq_file_net.h>
|
|
|
|
|
2007-09-13 11:16:29 +04:00
|
|
|
/* Init's network namespace */
|
2007-09-12 13:50:50 +04:00
|
|
|
extern struct net init_net;
|
2008-04-04 00:04:33 +04:00
|
|
|
|
2012-06-14 13:16:42 +04:00
|
|
|
#ifdef CONFIG_NET_NS
|
2013-09-21 21:22:48 +04:00
|
|
|
struct net *copy_net_ns(unsigned long flags, struct user_namespace *user_ns,
|
|
|
|
struct net *old_net);
|
2008-04-02 11:09:29 +04:00
|
|
|
|
2018-07-21 00:56:53 +03:00
|
|
|
void net_ns_get_ownership(const struct net *net, kuid_t *uid, kgid_t *gid);
|
|
|
|
|
2017-05-30 12:38:12 +03:00
|
|
|
void net_ns_barrier(void);
|
2012-06-14 13:16:42 +04:00
|
|
|
#else /* CONFIG_NET_NS */
|
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/nsproxy.h>
|
2012-06-14 13:31:10 +04:00
|
|
|
static inline struct net *copy_net_ns(unsigned long flags,
|
|
|
|
struct user_namespace *user_ns, struct net *old_net)
|
2007-09-27 09:04:26 +04:00
|
|
|
{
|
2012-06-14 13:16:42 +04:00
|
|
|
if (flags & CLONE_NEWNET)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
return old_net;
|
2007-09-27 09:04:26 +04:00
|
|
|
}
|
2017-05-30 12:38:12 +03:00
|
|
|
|
2018-07-21 00:56:53 +03:00
|
|
|
static inline void net_ns_get_ownership(const struct net *net,
|
|
|
|
kuid_t *uid, kgid_t *gid)
|
|
|
|
{
|
|
|
|
*uid = GLOBAL_ROOT_UID;
|
|
|
|
*gid = GLOBAL_ROOT_GID;
|
|
|
|
}
|
|
|
|
|
2017-05-30 12:38:12 +03:00
|
|
|
static inline void net_ns_barrier(void) {}
|
2012-06-14 13:16:42 +04:00
|
|
|
#endif /* CONFIG_NET_NS */
|
2008-04-02 11:09:29 +04:00
|
|
|
|
|
|
|
|
|
|
|
extern struct list_head net_namespace_list;
|
2007-09-27 09:04:26 +04:00
|
|
|
|
2013-09-21 21:22:48 +04:00
|
|
|
struct net *get_net_ns_by_pid(pid_t pid);
|
2016-11-18 12:41:46 +03:00
|
|
|
struct net *get_net_ns_by_fd(int fd);
|
2009-07-10 13:51:35 +04:00
|
|
|
|
2014-02-09 20:59:14 +04:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
void ipx_register_sysctl(void);
|
|
|
|
void ipx_unregister_sysctl(void);
|
|
|
|
#else
|
|
|
|
#define ipx_register_sysctl()
|
|
|
|
#define ipx_unregister_sysctl()
|
|
|
|
#endif
|
|
|
|
|
2007-11-01 10:43:49 +03:00
|
|
|
#ifdef CONFIG_NET_NS
|
2013-09-21 21:22:48 +04:00
|
|
|
void __put_net(struct net *net);
|
2007-09-12 13:50:50 +04:00
|
|
|
|
|
|
|
static inline struct net *get_net(struct net *net)
|
|
|
|
{
|
2020-08-19 15:06:36 +03:00
|
|
|
refcount_inc(&net->ns.count);
|
2007-09-12 13:50:50 +04:00
|
|
|
return net;
|
|
|
|
}
|
|
|
|
|
2007-09-13 11:18:57 +04:00
|
|
|
static inline struct net *maybe_get_net(struct net *net)
|
|
|
|
{
|
|
|
|
/* Used when we know struct net exists but we
|
|
|
|
* aren't guaranteed a previous reference count
|
|
|
|
* exists. If the reference count is zero this
|
|
|
|
* function fails and returns NULL.
|
|
|
|
*/
|
2020-08-19 15:06:36 +03:00
|
|
|
if (!refcount_inc_not_zero(&net->ns.count))
|
2007-09-13 11:18:57 +04:00
|
|
|
net = NULL;
|
|
|
|
return net;
|
|
|
|
}
|
|
|
|
|
2007-09-12 13:50:50 +04:00
|
|
|
static inline void put_net(struct net *net)
|
|
|
|
{
|
2020-08-19 15:06:36 +03:00
|
|
|
if (refcount_dec_and_test(&net->ns.count))
|
2007-09-12 13:50:50 +04:00
|
|
|
__put_net(net);
|
|
|
|
}
|
|
|
|
|
2008-03-25 21:57:35 +03:00
|
|
|
static inline
|
|
|
|
int net_eq(const struct net *net1, const struct net *net2)
|
|
|
|
{
|
|
|
|
return net1 == net2;
|
|
|
|
}
|
2011-06-09 05:13:01 +04:00
|
|
|
|
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 00:14:26 +03:00
|
|
|
static inline int check_net(const struct net *net)
|
|
|
|
{
|
2020-08-19 15:06:36 +03:00
|
|
|
return refcount_read(&net->ns.count) != 0;
|
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 00:14:26 +03:00
|
|
|
}
|
|
|
|
|
2013-09-21 21:22:48 +04:00
|
|
|
void net_drop_ns(void *);
|
2011-06-09 05:13:01 +04:00
|
|
|
|
2007-11-01 10:43:49 +03:00
|
|
|
#else
|
2008-06-21 09:16:51 +04:00
|
|
|
|
2007-11-01 10:43:49 +03:00
|
|
|
static inline struct net *get_net(struct net *net)
|
|
|
|
{
|
|
|
|
return net;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void put_net(struct net *net)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2008-04-16 12:58:04 +04:00
|
|
|
static inline struct net *maybe_get_net(struct net *net)
|
|
|
|
{
|
|
|
|
return net;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
int net_eq(const struct net *net1, const struct net *net2)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
2011-06-09 05:13:01 +04:00
|
|
|
|
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 00:14:26 +03:00
|
|
|
static inline int check_net(const struct net *net)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-06-09 05:13:01 +04:00
|
|
|
#define net_drop_ns NULL
|
2008-04-16 12:58:04 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
|
2015-03-12 07:06:44 +03:00
|
|
|
typedef struct {
|
2008-11-12 11:53:30 +03:00
|
|
|
#ifdef CONFIG_NET_NS
|
2015-03-12 07:06:44 +03:00
|
|
|
struct net *net;
|
|
|
|
#endif
|
|
|
|
} possible_net_t;
|
2008-11-12 11:53:30 +03:00
|
|
|
|
2015-03-12 07:06:44 +03:00
|
|
|
static inline void write_pnet(possible_net_t *pnet, struct net *net)
|
2008-11-12 11:53:30 +03:00
|
|
|
{
|
2015-03-12 07:06:44 +03:00
|
|
|
#ifdef CONFIG_NET_NS
|
|
|
|
pnet->net = net;
|
|
|
|
#endif
|
2008-11-12 11:53:30 +03:00
|
|
|
}
|
|
|
|
|
2015-03-12 07:06:44 +03:00
|
|
|
static inline struct net *read_pnet(const possible_net_t *pnet)
|
2008-11-12 11:53:30 +03:00
|
|
|
{
|
2015-03-12 07:06:44 +03:00
|
|
|
#ifdef CONFIG_NET_NS
|
|
|
|
return pnet->net;
|
2008-11-12 11:53:30 +03:00
|
|
|
#else
|
2015-03-12 07:06:44 +03:00
|
|
|
return &init_net;
|
2008-11-12 11:53:30 +03:00
|
|
|
#endif
|
2015-03-12 07:06:44 +03:00
|
|
|
}
|
2008-04-16 12:58:04 +04:00
|
|
|
|
net: Introduce net_rwsem to protect net_namespace_list
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.
This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.
Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 19:20:32 +03:00
|
|
|
/* Protected by net_rwsem */
|
2007-09-12 13:50:50 +04:00
|
|
|
#define for_each_net(VAR) \
|
|
|
|
list_for_each_entry(VAR, &net_namespace_list, list)
|
2019-09-30 11:15:09 +03:00
|
|
|
#define for_each_net_continue_reverse(VAR) \
|
|
|
|
list_for_each_entry_continue_reverse(VAR, &net_namespace_list, list)
|
2009-07-10 13:51:33 +04:00
|
|
|
#define for_each_net_rcu(VAR) \
|
|
|
|
list_for_each_entry_rcu(VAR, &net_namespace_list, list)
|
|
|
|
|
2007-10-09 07:38:39 +04:00
|
|
|
#ifdef CONFIG_NET_NS
|
|
|
|
#define __net_init
|
|
|
|
#define __net_exit
|
2007-11-13 14:23:50 +03:00
|
|
|
#define __net_initdata
|
2012-10-05 04:12:11 +04:00
|
|
|
#define __net_initconst
|
2007-10-09 07:38:39 +04:00
|
|
|
#else
|
|
|
|
#define __net_init __init
|
2016-08-03 00:03:33 +03:00
|
|
|
#define __net_exit __ref
|
2007-11-13 14:23:50 +03:00
|
|
|
#define __net_initdata __initdata
|
2012-10-05 04:12:11 +04:00
|
|
|
#define __net_initconst __initconst
|
2007-10-09 07:38:39 +04:00
|
|
|
#endif
|
2007-09-12 13:50:50 +04:00
|
|
|
|
netns: fix GFP flags in rtnl_net_notifyid()
In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
but there are a few paths calling rtnl_net_notifyid() from atomic
context or from RCU critical sections. The later also precludes the use
of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
call is wrong too, as it uses GFP_KERNEL unconditionally.
Therefore, we need to pass the GFP flags as parameter and propagate it
through function calls until the proper flags can be determined.
In most cases, GFP_KERNEL is fine. The exceptions are:
* openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
indirectly call rtnl_net_notifyid() from RCU critical section,
* rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
parameter.
Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
by nlmsg_new(). The function is allowed to sleep, so better make the
flags consistent with the ones used in the following
ovs_vport_cmd_fill_info() call.
Found by code inspection.
Fixes: 9a9634545c70 ("netns: notify netns id events")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23 19:39:04 +03:00
|
|
|
int peernet2id_alloc(struct net *net, struct net *peer, gfp_t gfp);
|
2020-01-16 23:16:46 +03:00
|
|
|
int peernet2id(const struct net *net, struct net *peer);
|
|
|
|
bool peernet_has_id(const struct net *net, struct net *peer);
|
|
|
|
struct net *get_net_ns_by_id(const struct net *net, int id);
|
2015-01-15 17:11:15 +03:00
|
|
|
|
2007-09-12 13:50:50 +04:00
|
|
|
struct pernet_operations {
|
|
|
|
struct list_head list;
|
2018-03-13 13:55:55 +03:00
|
|
|
/*
|
|
|
|
* Below methods are called without any exclusive locks.
|
|
|
|
* More than one net may be constructed and destructed
|
|
|
|
* in parallel on several cpus. Every pernet_operations
|
|
|
|
* have to keep in mind all other pernet_operations and
|
|
|
|
* to introduce a locking, if they share common resources.
|
|
|
|
*
|
2018-03-27 18:02:32 +03:00
|
|
|
* The only time they are called with exclusive lock is
|
|
|
|
* from register_pernet_subsys(), unregister_pernet_subsys()
|
|
|
|
* register_pernet_device() and unregister_pernet_device().
|
|
|
|
*
|
2018-03-13 13:55:55 +03:00
|
|
|
* Exit methods using blocking RCU primitives, such as
|
|
|
|
* synchronize_rcu(), should be implemented via exit_batch.
|
|
|
|
* Then, destruction of a group of net requires single
|
|
|
|
* synchronize_rcu() related to these pernet_operations,
|
|
|
|
* instead of separate synchronize_rcu() for every net.
|
|
|
|
* Please, avoid synchronize_rcu() at all, where it's possible.
|
2019-06-18 21:08:59 +03:00
|
|
|
*
|
|
|
|
* Note that a combination of pre_exit() and exit() can
|
|
|
|
* be used, since a synchronize_rcu() is guaranteed between
|
|
|
|
* the calls.
|
2018-03-13 13:55:55 +03:00
|
|
|
*/
|
2007-09-12 13:50:50 +04:00
|
|
|
int (*init)(struct net *net);
|
2019-06-18 21:08:59 +03:00
|
|
|
void (*pre_exit)(struct net *net);
|
2007-09-12 13:50:50 +04:00
|
|
|
void (*exit)(struct net *net);
|
2009-12-03 05:29:03 +03:00
|
|
|
void (*exit_batch)(struct list_head *net_exit_list);
|
netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 04:58:21 +03:00
|
|
|
unsigned int *id;
|
2009-11-30 01:25:28 +03:00
|
|
|
size_t size;
|
2007-09-12 13:50:50 +04:00
|
|
|
};
|
|
|
|
|
2009-02-22 11:11:09 +03:00
|
|
|
/*
|
|
|
|
* Use these carefully. If you implement a network device and it
|
|
|
|
* needs per network namespace operations use device pernet operations,
|
|
|
|
* otherwise use pernet subsys operations.
|
|
|
|
*
|
2009-07-15 10:16:34 +04:00
|
|
|
* Network interfaces need to be removed from a dying netns _before_
|
|
|
|
* subsys notifiers can be called, as most of the network code cleanup
|
|
|
|
* (which is done from subsys notifiers) runs with the assumption that
|
|
|
|
* dev_remove_pack has been called so no new packets will arrive during
|
|
|
|
* and after the cleanup functions have been called. dev_remove_pack
|
|
|
|
* is not per namespace so instead the guarantee of no more packets
|
|
|
|
* arriving in a network namespace is provided by ensuring that all
|
|
|
|
* network devices and all sockets have left the network namespace
|
|
|
|
* before the cleanup methods are called.
|
2009-02-22 11:11:09 +03:00
|
|
|
*
|
|
|
|
* For the longest time the ipv4 icmp code was registered as a pernet
|
|
|
|
* device which caused kernel oops, and panics during network
|
|
|
|
* namespace cleanup. So please don't get this wrong.
|
|
|
|
*/
|
2013-09-21 21:22:48 +04:00
|
|
|
int register_pernet_subsys(struct pernet_operations *);
|
|
|
|
void unregister_pernet_subsys(struct pernet_operations *);
|
|
|
|
int register_pernet_device(struct pernet_operations *);
|
|
|
|
void unregister_pernet_device(struct pernet_operations *);
|
2009-11-30 01:25:28 +03:00
|
|
|
|
2007-11-30 15:55:42 +03:00
|
|
|
struct ctl_table;
|
2008-05-20 00:45:33 +04:00
|
|
|
|
2012-04-19 17:20:32 +04:00
|
|
|
#ifdef CONFIG_SYSCTL
|
2013-09-21 21:22:48 +04:00
|
|
|
int net_sysctl_init(void);
|
|
|
|
struct ctl_table_header *register_net_sysctl(struct net *net, const char *path,
|
|
|
|
struct ctl_table *table);
|
|
|
|
void unregister_net_sysctl_table(struct ctl_table_header *header);
|
2012-04-23 16:13:02 +04:00
|
|
|
#else
|
|
|
|
static inline int net_sysctl_init(void) { return 0; }
|
|
|
|
static inline struct ctl_table_header *register_net_sysctl(struct net *net,
|
|
|
|
const char *path, struct ctl_table *table)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline void unregister_net_sysctl_table(struct ctl_table_header *header)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-01-16 23:16:46 +03:00
|
|
|
static inline int rt_genid_ipv4(const struct net *net)
|
2012-09-11 02:09:44 +04:00
|
|
|
{
|
2013-07-30 04:33:53 +04:00
|
|
|
return atomic_read(&net->ipv4.rt_genid);
|
2012-09-11 02:09:44 +04:00
|
|
|
}
|
|
|
|
|
ipv6: Use global sernum for dst validation with nexthop objects
Nik reported a bug with pcpu dst cache when nexthop objects are
used illustrated by the following:
$ ip netns add foo
$ ip -netns foo li set lo up
$ ip -netns foo addr add 2001:db8:11::1/128 dev lo
$ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
$ ip li add veth1 type veth peer name veth2
$ ip li set veth1 up
$ ip addr add 2001:db8:10::1/64 dev veth1
$ ip li set dev veth2 netns foo
$ ip -netns foo li set veth2 up
$ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
$ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
$ ip -6 route add 2001:db8:11::1/128 nhid 100
Create a pcpu entry on cpu 0:
$ taskset -a -c 0 ip -6 route get 2001:db8:11::1
Re-add the route entry:
$ ip -6 ro del 2001:db8:11::1
$ ip -6 route add 2001:db8:11::1/128 nhid 100
Route get on cpu 0 returns the stale pcpu:
$ taskset -a -c 0 ip -6 route get 2001:db8:11::1
RTNETLINK answers: Network is unreachable
While cpu 1 works:
$ taskset -a -c 1 ip -6 route get 2001:db8:11::1
2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
Conversion of FIB entries to work with external nexthop objects
missed an important difference between IPv4 and IPv6 - how dst
entries are invalidated when the FIB changes. IPv4 has a per-network
namespace generation id (rt_genid) that is bumped on changes to the FIB.
Checking if a dst_entry is still valid means comparing rt_genid in the
rtable to the current value of rt_genid for the namespace.
IPv6 also has a per network namespace counter, fib6_sernum, but the
count is saved per fib6_node. With the per-node counter only dst_entries
based on fib entries under the node are invalidated when changes are
made to the routes - limiting the scope of invalidations. IPv6 uses a
reference in the rt6_info, 'from', to track the corresponding fib entry
used to create the dst_entry. When validating a dst_entry, the 'from'
is used to backtrack to the fib6_node and check the sernum of it to the
cookie passed to the dst_check operation.
With the inline format (nexthop definition inline with the fib6_info),
dst_entries cached in the fib6_nh have a 1:1 correlation between fib
entries, nexthop data and dst_entries. With external nexthops, IPv6
looks more like IPv4 which means multiple fib entries across disparate
fib6_nodes can all reference the same fib6_nh. That means validation
of dst_entries based on external nexthops needs to use the IPv4 format
- the per-network namespace counter.
Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
rt6_get_cookie to return sernum if it is set and update dst_check for
IPv6 to look for sernum set and based the check on it if so. Finally,
rt6_get_pcpu_route needs to validate the cached entry before returning
a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
__mkroute_output for IPv4).
This problem only affects routes using the new, external nexthops.
Thanks to the kbuild test robot for catching the IS_ENABLED needed
around rt_genid_ipv6 before I sent this out.
Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 17:53:08 +03:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
|
|
|
static inline int rt_genid_ipv6(const struct net *net)
|
|
|
|
{
|
|
|
|
return atomic_read(&net->ipv6.fib6_sernum);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-07-30 04:33:53 +04:00
|
|
|
static inline void rt_genid_bump_ipv4(struct net *net)
|
2012-09-11 02:09:44 +04:00
|
|
|
{
|
2013-07-30 04:33:53 +04:00
|
|
|
atomic_inc(&net->ipv4.rt_genid);
|
|
|
|
}
|
|
|
|
|
2014-09-28 02:46:06 +04:00
|
|
|
extern void (*__fib6_flush_trees)(struct net *net);
|
2013-07-30 04:33:53 +04:00
|
|
|
static inline void rt_genid_bump_ipv6(struct net *net)
|
|
|
|
{
|
2014-09-28 02:46:06 +04:00
|
|
|
if (__fib6_flush_trees)
|
|
|
|
__fib6_flush_trees(net);
|
2013-07-30 04:33:53 +04:00
|
|
|
}
|
|
|
|
|
2014-04-18 05:22:54 +04:00
|
|
|
#if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
|
|
|
|
static inline struct netns_ieee802154_lowpan *
|
|
|
|
net_ieee802154_lowpan(struct net *net)
|
|
|
|
{
|
|
|
|
return &net->ieee802154_lowpan;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-07-30 04:33:53 +04:00
|
|
|
/* For callers who don't really care about whether it's IPv4 or IPv6 */
|
|
|
|
static inline void rt_genid_bump_all(struct net *net)
|
|
|
|
{
|
|
|
|
rt_genid_bump_ipv4(net);
|
|
|
|
rt_genid_bump_ipv6(net);
|
2012-09-11 02:09:44 +04:00
|
|
|
}
|
2007-11-30 15:55:42 +03:00
|
|
|
|
2020-01-16 23:16:46 +03:00
|
|
|
static inline int fnhe_genid(const struct net *net)
|
2013-05-28 00:46:33 +04:00
|
|
|
{
|
|
|
|
return atomic_read(&net->fnhe_genid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void fnhe_genid_bump(struct net *net)
|
|
|
|
{
|
|
|
|
atomic_inc(&net->fnhe_genid);
|
|
|
|
}
|
|
|
|
|
2007-09-12 13:50:50 +04:00
|
|
|
#endif /* __NET_NET_NAMESPACE_H */
|