Граф коммитов

2370 Коммитов

Автор SHA1 Сообщение Дата
David S. Miller f8785c5514 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says:

====================
This batch contains fives nf_tables patches for your net-next tree,
they are:

* Fix possible use after free in the module removal path of the
  x_tables compatibility layer, from Dan Carpenter.

* Add filter chain type for the bridge family, from myself.

* Fix Kconfig dependencies of the nf_tables bridge family with
  the core, from myself.

* Fix sparse warnings in nft_nat, from Tomasz Bursztyka.

* Remove duplicated include in the IPv4 family support for nf_tables,
  from Wei Yongjun.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 19:48:57 -05:00
David S. Miller 72c39a0ade Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
This is another batch containing Netfilter/IPVS updates for your net-next
tree, they are:

* Six patches to make the ipt_CLUSTERIP target support netnamespace,
  from Gao feng.

* Two cleanups for the nf_conntrack_acct infrastructure, introducing
  a new structure to encapsulate conntrack counters, from Holger
  Eitzenberger.

* Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann.

* Skip checksum recalculation in SCTP support for IPVS, also from
  Daniel Borkmann.

* Fix behavioural change in xt_socket after IP early demux, from
  Florian Westphal.

* Fix bogus large memory allocation in the bitmap port set type in ipset,
  from Jozsef Kadlecsik.

* Fix possible compilation issues in the hash netnet set type in ipset,
  also from Jozsef Kadlecsik.

* Define constants to identify netlink callback data in ipset dumps,
  again from Jozsef Kadlecsik.

* Use sock_gen_put() in xt_socket to replace xt_socket_put_sk,
  from Eric Dumazet.

* Improvements for the SH scheduler in IPVS, from Alexander Frolkin.

* Remove extra delay due to unneeded rcu barrier in IPVS net namespace
  cleanup path, from Julian Anastasov.

* Save some cycles in ip6t_REJECT by skipping checksum validation in
  packets leaving from our stack, from Stanislav Fomichev.

* Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from
  Julian Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 19:46:58 -05:00
Dan Carpenter c359c4157c netfilter: nft_compat: use _safe version of list_for_each
We need to use the _safe version of list_for_each_entry() here otherwise
we have a use after free bug.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-04 22:58:30 +01:00
David S. Miller 394efd19d5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be.h
	drivers/net/netconsole.c
	net/bridge/br_private.h

Three mostly trivial conflicts.

The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.

In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".

Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 13:48:30 -05:00
Holger Eitzenberger 4542fa4727 netfilter: ctnetlink: account both directions in one step
With the intent to dump other accounting data later.
This patch is a cleanup.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-03 21:49:32 +01:00
Holger Eitzenberger f7b13e4330 netfilter: introduce nf_conn_acct structure
Encapsulate counters for both directions into nf_conn_acct. During
that process also consistently name pointers to the extend 'acct',
not 'counters'. This patch is a cleanup.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-03 21:48:49 +01:00
Daniel Borkmann 97203abe6b net: ipvs: sctp: do not recalc sctp csum when ports didn't change
Unlike UDP or TCP, we do not take the pseudo-header into
account in SCTP checksums. So in case port mapping is the
very same, we do not need to recalculate the whole SCTP
checksum in software, which is very expensive.

Also, similarly as in TCP, take into account when a private
helper mangled the packet. In that case, we also need to
recalculate the checksum even if ports might be same.

Thanks for feedback regarding skb->ip_summed checks from
Julian Anastasov; here's a discussion on these checks for
snat and dnat:

* For snat_handler(), we can see CHECKSUM_PARTIAL from
  virtual devices, and from LOCAL_OUT, otherwise it
  should be CHECKSUM_UNNECESSARY. In general, in snat it
  is more complex. skb contains the original route and
  ip_vs_route_me_harder() can change the route after
  snat_handler. So, for locally generated replies from
  local server we can not preserve the CHECKSUM_PARTIAL
  mode. It is an chicken or egg dilemma: snat_handler
  needs the device after rerouting (to check for
  NETIF_F_SCTP_CSUM), while ip_route_me_harder() wants
  the snat_handler() to put the new saddr for proper
  rerouting.

* For dnat_handler(), we should not see CHECKSUM_COMPLETE
  for SCTP, in fact the small set of drivers that support
  SCTP offloading return CHECKSUM_UNNECESSARY on correctly
  received SCTP csum. We can see CHECKSUM_PARTIAL from
  local stack or received from virtual drivers. The idea is
  that SCTP decides to avoid csum calculation if hardware
  supports offloading. IPVS can change the device after
  rerouting to real server but we can preserve the
  CHECKSUM_PARTIAL mode if the new device supports
  offloading too. This works because skb dst is changed
  before dnat_handler and we see the new device. So, checks
  in the 'if' part will decide whether it is ok to keep
  CHECKSUM_PARTIAL for the output. If the packet was with
  CHECKSUM_NONE, hence we deal with unknown checksum. As we
  recalculate the sum for IP header in all cases, it should
  be safe to use CHECKSUM_UNNECESSARY. We can forward wrong
  checksum in this case (without cp->app). In case of
  CHECKSUM_UNNECESSARY, the csum was valid on receive.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-30 09:48:16 +09:00
Holger Eitzenberger d954777324 netfilter: xt_NFQUEUE: fix --queue-bypass regression
V3 of the NFQUEUE target ignores the --queue-bypass flag,
causing packets to be dropped when the userspace listener
isn't running.

Regression is in since 8746ddcf12 ("netfilter: xt_NFQUEUE:
introduce CPU fanout").

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-29 13:05:54 +01:00
Tomasz Bursztyka 98c37b6b01 netfilter: nft_nat: Fix endianness issue reported by sparse
This patch fixes this:

CHECK   net/netfilter/nft_nat.c
net/netfilter/nft_nat.c:50:43: warning: incorrect type in assignment (different base types)
net/netfilter/nft_nat.c:50:43:    expected restricted __be32 [addressable] [usertype] ip
net/netfilter/nft_nat.c:50:43:    got unsigned int [unsigned] [usertype] <noident>
net/netfilter/nft_nat.c:51:43: warning: incorrect type in assignment (different base types)
net/netfilter/nft_nat.c:51:43:    expected restricted __be32 [addressable] [usertype] ip
net/netfilter/nft_nat.c:51:43:    got unsigned int [unsigned] [usertype] <noident>
net/netfilter/nft_nat.c:65:37: warning: incorrect type in assignment (different base types)
net/netfilter/nft_nat.c:65:37:    expected restricted __be16 [addressable] [assigned] [usertype] all
net/netfilter/nft_nat.c:65:37:    got unsigned int [unsigned] <noident>
net/netfilter/nft_nat.c:66:37: warning: incorrect type in assignment (different base types)
net/netfilter/nft_nat.c:66:37:    expected restricted __be16 [addressable] [assigned] [usertype] all
net/netfilter/nft_nat.c:66:37:    got unsigned int [unsigned] <noident>

Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-28 17:41:49 +01:00
Daniel Borkmann 6e7cd27c0f net: ipvs: sctp: add missing verdict assignments in sctp_conn_schedule
If skb_header_pointer() fails, we need to assign a verdict, that is
NF_DROP in this case, otherwise, we would leave the verdict from
conn_schedule() uninitialized when returning.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-28 19:00:49 +09:00
Michael Opdenacker 1e56555ee1 netfilter: ipset: remove duplicate define
This patch removes a duplicate define from
net/netfilter/ipset/ip_set_hash_gen.h

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-10-27 19:24:45 +01:00
Jozsef Kadlecsik dc476e7c8e netfilter:ipset: Fix memory allocation for bitmap:port
At the restructuring of the bitmap types creation in ipset, for the
bitmap:port type wrong (too large) memory allocation was copied
(netfilter bugzilla id #859).

Reported-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-10-27 19:24:39 +01:00
David S. Miller afb14c7cb6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains three netfilter fixes for your net
tree, they are:

* A couple of fixes to resolve info leak to userspace due to uninitialized
  memory area in ulogd, from Mathias Krause.

* Fix instruction ordering issues that may lead to the access of
  uninitialized data in x_tables. The problem involves the table update
 (producer) and the main packet matching (consumer) routines. Detected in
  SMP ARMv7, from Will Deacon.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23 16:55:04 -04:00
David S. Miller c3fa32b976 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/qmi_wwan.c
	include/net/dst.h

Trivial merge conflicts, both were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23 16:49:34 -04:00
Jozsef Kadlecsik 1a869205c7 netfilter: ipset: The unnamed union initialization may lead to compilation error
The unnamed union should be possible to be initialized directly, but
unfortunately it's not so:

/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In
function ?hash_netnet4_kadt?:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141:
error: unknown field ?cidr? specified in initializer

Reported-by: Husnu Demir <hdemir@metu.edu.tr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22 10:14:01 +02:00
Jozsef Kadlecsik 93302880d8 netfilter: ipset: Use netlink callback dump args only
Instead of cb->data, use callback dump args only and introduce symbolic
names instead of plain numbers at accessing the argument members.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22 10:13:59 +02:00
Will Deacon b416c144f4 netfilter: x_tables: fix ordering of jumpstack allocation and table update
During kernel stability testing on an SMP ARMv7 system, Yalin Wang
reported the following panic from the netfilter code:

  1fe0: 0000001c 5e2d3b10 4007e779 4009e110 60000010 00000032 ff565656 ff545454
  [<c06c48dc>] (ipt_do_table+0x448/0x584) from [<c0655ef0>] (nf_iterate+0x48/0x7c)
  [<c0655ef0>] (nf_iterate+0x48/0x7c) from [<c0655f7c>] (nf_hook_slow+0x58/0x104)
  [<c0655f7c>] (nf_hook_slow+0x58/0x104) from [<c0683bbc>] (ip_local_deliver+0x88/0xa8)
  [<c0683bbc>] (ip_local_deliver+0x88/0xa8) from [<c0683718>] (ip_rcv_finish+0x418/0x43c)
  [<c0683718>] (ip_rcv_finish+0x418/0x43c) from [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598)
  [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) from [<c062b314>] (process_backlog+0x84/0x158)
  [<c062b314>] (process_backlog+0x84/0x158) from [<c062de84>] (net_rx_action+0x70/0x1dc)
  [<c062de84>] (net_rx_action+0x70/0x1dc) from [<c0088230>] (__do_softirq+0x11c/0x27c)
  [<c0088230>] (__do_softirq+0x11c/0x27c) from [<c008857c>] (do_softirq+0x44/0x50)
  [<c008857c>] (do_softirq+0x44/0x50) from [<c0088614>] (local_bh_enable_ip+0x8c/0xd0)
  [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) from [<c06b0330>] (inet_stream_connect+0x164/0x298)
  [<c06b0330>] (inet_stream_connect+0x164/0x298) from [<c061d68c>] (sys_connect+0x88/0xc8)
  [<c061d68c>] (sys_connect+0x88/0xc8) from [<c000e340>] (ret_fast_syscall+0x0/0x30)
  Code: 2a000021 e59d2028 e59de01c e59f011c (e7824103)
  ---[ end trace da227214a82491bd ]---
  Kernel panic - not syncing: Fatal exception in interrupt

This comes about because CPU1 is executing xt_replace_table in response
to a setsockopt syscall, resulting in:

	ret = xt_jumpstack_alloc(newinfo);
		--> newinfo->jumpstack = kzalloc(size, GFP_KERNEL);

	[...]

	table->private = newinfo;
	newinfo->initial_entries = private->initial_entries;

Meanwhile, CPU0 is handling the network receive path and ends up in
ipt_do_table, resulting in:

	private = table->private;

	[...]

	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];

On weakly ordered memory architectures, the writes to table->private
and newinfo->jumpstack from CPU1 can be observed out of order by CPU0.
Furthermore, on architectures which don't respect ordering of address
dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered.

This patch adds an smp_wmb() before the assignment to table->private
(which is essentially publishing newinfo) to ensure that all writes to
newinfo will be observed before plugging it into the table structure.
A dependent-read barrier is also added on the consumer sides, to ensure
the same ordering requirements are also respected there.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Wang, Yalin <Yalin.Wang@sonymobile.com>
Tested-by: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22 10:11:29 +02:00
Julian Anastasov 56e42441ed netfilter: nf_conntrack: fix rt6i_gateway checks for H.323 helper
Now when rt6_nexthop() can return nexthop address we can use it
for proper nexthop comparison of directly connected destinations.
For more information refer to commit bbb5823cf7
("netfilter: nf_conntrack: fix rt_gateway checks for H.323 helper").

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21 18:37:01 -04:00
Joe Perches c1b1203d65 net: misc: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-19 19:12:11 -04:00
Eric Dumazet 1a8bf6eeef netfilter: xt_socket: use sock_gen_put()
TCP listener refactoring, part 7 :

Use sock_gen_put() instead of xt_socket_put_sk() for future
SYN_RECV support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:27:25 +02:00
Alexander Frolkin 1255ce5f10 ipvs: improved SH fallback strategy
Improve the SH fallback realserver selection strategy.

With sh and sh-fallback, if a realserver is down, this attempts to
distribute the traffic that would have gone to that server evenly
among the remaining servers.

Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15 10:54:50 +09:00
Julian Anastasov 9e4e948a3e ipvs: avoid rcu_barrier during netns cleanup
commit 578bc3ef1e ("ipvs: reorganize dest trash") added
rcu_barrier() on cleanup to wait dest users and schedulers
like LBLC and LBLCR to put their last dest reference.
Using rcu_barrier with many namespaces is problematic.

Trying to fix it by freeing dest with kfree_rcu is not
a solution, RCU callbacks can run in parallel and execution
order is random.

Fix it by creating new function ip_vs_dest_put_and_free()
which is heavier than ip_vs_dest_put(). We will use it just
for schedulers like LBLC, LBLCR that can delay their dest
release.

By default, dests reference is above 0 if they are present in
service and it is 0 when deleted but still in trash list.
Change the dest trash code to use ip_vs_dest_put_and_free(),
so that refcnt -1 can be used for freeing. As result,
such checks remain in slow path and the rcu_barrier() from
netns cleanup can be removed.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15 10:36:01 +09:00
Pablo Neira Ayuso b5bc89bfa0 netfilter: nf_tables: add trace support
This patch adds support for tracing the packet travel through
the ruleset, in a similar fashion to x_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:01:02 +02:00
Pablo Neira Ayuso 0628b123c9 netfilter: nfnetlink: add batch support and use it from nf_tables
This patch adds a batch support to nfnetlink. Basically, it adds
two new control messages:

* NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
  the nfgenmsg->res_id indicates the nfnetlink subsystem ID.

* NFNL_MSG_BATCH_END, that results in the invocation of the
  ss->commit callback function. If not specified or an error
  ocurred in the batch, the ss->abort function is invoked
  instead.

The end message represents the commit operation in nftables, the
lack of end message results in an abort. This patch also adds the
.call_batch function that is only called from the batch receival
path.

This patch adds atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.

The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:

00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
 ^
 gencursor

Once you invoke the transition to the next generation, the global
gencursor is updated:

00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor

If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.

This new operation can be used from the user-space utility
that controls the firewall, eg.

nft -f restore

The rule updates contained in `file' will be applied atomically.

cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop   #2
-EOF-

Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.

There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the commit list that
contain rules that require updates is finished.

Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.

This patch squashed the following patches from Pablo:

* nf_tables: atomic rule updates and dumps
* nf_tables: get rid of per rule list_head for commits
* nf_tables: use per netns commit list
* nfnetlink: add batch support and use it from nf_tables
* nf_tables: all rule updates are transactional
* nf_tables: attach replacement rule after stale one
* nf_tables: do not allow deletion/replacement of stale rules
* nf_tables: remove unused NFTA_RULE_FLAGS

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:01:01 +02:00
Eric Leblond 5e94846686 netfilter: nf_tables: add insert operation
This patch adds a new rule attribute NFTA_RULE_POSITION which is
used to store the position of a rule relatively to the others.
By providing the create command and specifying the position, the
rule is inserted after the rule with the handle equal to the
provided position.

Regarding notification, the position attribute specifies the
handle of the previous rule to make sure we don't point to any
stale rule in notifications coming from the commit path.

This patch includes the following fix from Pablo:

* nf_tables: fix rule deletion event reporting

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:01:00 +02:00
Pablo Neira Ayuso 99633ab29b netfilter: nf_tables: complete net namespace support
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:59 +02:00
Tomasz Bursztyka eb31628e37 netfilter: nf_tables: Add support for IPv6 NAT
This patch generalizes the NAT expression to support both IPv4 and IPv6
using the existing IPv4/IPv6 NAT infrastructure. This also adds the
NAT chain type for IPv6.

This patch collapses the following patches that were posted to the
netfilter-devel mailing list, from Tomasz:

* nf_tables: Change NFTA_NAT_ attributes to better semantic significance
* nf_tables: Split IPv4 NAT into NAT expression and IPv4 NAT chain
* nf_tables: Add support for IPv6 NAT expression
* nf_tables: Add support for IPv6 NAT chain
* nf_tables: Fix up build issue on IPv6 NAT support

And, from Pablo Neira Ayuso:

* fix missing dependencies in nft_chain_nat

Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:58 +02:00
Pablo Neira Ayuso 9ddf632357 netfilter: nf_tables: add support for dormant tables
This patch allows you to temporarily disable an entire table.
You can change the state of a dormant table via NFT_MSG_NEWTABLE
messages. Using this operation you can wake up a table, so their
chains are registered.

This provides atomicity at chain level. Thus, the rule-set of one
chain is applied at once, avoiding any possible intermediate state
in every chain. Still, the chains that belongs to a table are
registered consecutively. This also allows you to have inactive
tables in the kernel.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:57 +02:00
Pablo Neira Ayuso c54032e05b netfilter: nf_tables: nft_payload: fix transport header base
We cannot use skb->transport_header since it's unset, use
pkt->xt.thoff instead.

Now possible using information made available through the x_tables
compatibility layer.

Reported-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:56 +02:00
Pablo Neira Ayuso 0ca743a559 netfilter: nf_tables: add compatibility layer for x_tables
This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.

This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.

In order to get this compatibility layer working, I've done the
following things:

* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.

* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.

* add support for default policy to base chains, required to emulate
  x_tables.

* add NFTA_CHAIN_USE attribute to obtain the number of references to
  chains, required by x_tables emulation.

* add chain packet/byte counters using per-cpu.

* support 32-64 bits compat.

For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.

From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled

From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes

From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT

From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 18:00:04 +02:00
Pablo Neira Ayuso 9370761c56 netfilter: nf_tables: convert built-in tables/chains to chain types
This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.

After this patch, you have to specify the chain type when
creating a new chain:

 add chain ip filter output { type filter hook input priority 0; }
                              ^^^^ ------

The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:11 +02:00
Patrick McHardy c29b72e025 netfilter: nft_payload: add optimized payload implementation for small loads
Add an optimized payload expression implementation for small (up to 4 bytes)
aligned data loads from the linear packet area.

This patch also includes original Patrick McHardy's entitled (nf_tables:
inline nft_payload_fast_eval() into main evaluation loop).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:10 +02:00
Patrick McHardy cb7dbfd039 netfilter: nf_tables: add optimized data comparison for small values
Add an optimized version of nft_data_cmp() that only handles values of to
4 bytes length.

This patch includes original Patrick McHardy's patch entitled (nf_tables:
inline nft_cmp_fast_eval() into main evaluation loop).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:09 +02:00
Patrick McHardy ef1f7df917 netfilter: nf_tables: expression ops overloading
Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.

This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:08 +02:00
Patrick McHardy 20a69341f2 netfilter: nf_tables: add netlink set API
This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:

- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets

- addition of set elements
- removal of set elements
- dumping of all set elements

Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.

Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.

Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.

Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.

The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.

Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.

This patch includes the new 'lookup' expression to look up for element
matching in the set.

This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.

From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection

From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:16:07 +02:00
Patrick McHardy 96518518cc netfilter: add nftables
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.

In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:

* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
  registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.

Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.

nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).

This patch includes the following components:

* the netlink API: net/netfilter/nf_tables_api.c and
  include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
  net/ipv4/netfilter/nf_tables_ipv4.c
  net/ipv6/netfilter/nf_tables_ipv6.c
  net/ipv4/netfilter/nf_tables_arp.c
  net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
  net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
  net/ipv4/netfilter/nf_table_route_ipv4.c
  net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
  include/net/netfilter/nf_tables.h
  include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
  net/netfilter/nft_expr_template.c
  and the preliminary implementation of the meta target
  net/netfilter/nft_meta_target.c

It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.

This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:

From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps

From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release

From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation

From Florian Westphal:
* nft_log: group is u16, snaplen u32

From Phil Oester:
* nf_tables: operational limit match

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 17:15:48 +02:00
Pablo Neira Ayuso f59cb0453c netfilter: nf_nat: move alloc_null_binding to nf_nat_core.c
Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.

This is an adapted version of the original patch from Patrick McHardy.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 11:29:39 +02:00
Patrick McHardy 795aa6ef6a netfilter: pass hook ops to hookfn
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-14 11:29:31 +02:00
Eric Dumazet efe4208f47 ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 00:01:25 -04:00
David S. Miller d639feaaf3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter updates for your net-next tree,
mostly ipset improvements and enhancements features, they are:

* Don't call ip_nest_end needlessly in the error path from me, suggested
  by Pablo Neira Ayuso, from Jozsef Kadlecsik.

* Fixed sparse warnings about shadowed variable and missing rcu annotation
  and fix of "may be used uninitialized" warnings, also from Jozsef.

* Renamed simple macro names to avoid namespace issues, reported by David
  Laight, again from Jozsef.

* Use fix sized type for timeout in the extension part, and cosmetic
  ordering of matches and targets separatedly in xt_set.c, from Jozsef.

* Support package fragments for IPv4 protos without ports from Anders K.
  Pedersen. For example this allows a hash:ip,port ipset containing the
  entry 192.168.0.1,gre:0 to match all package fragments for PPTP VPN
  tunnels to/from the host. Without this patch only the first package
  fragment (with fragment offset 0) was matched.

* Introduced a new operation to get both setname and family, from Jozsef.
  ip[6]tables set match and SET target need to know the family of the set
  in order to reject adding rules which refer to a set with a non-mathcing
  family. Currently such rules are silently accepted and then ignored
  instead of generating an error message to the user.

* Reworked extensions support in ipset types from Jozsef. The approach of
  defining structures with all variations is not manageable as the
  number of extensions grows. Therefore a blob for the extensions is
  introduced, somewhat similar to conntrack. The support of extensions
  which need a per data destroy function is added as well.

* When an element timed out in a list:set type of set, the garbage
  collector skipped the checking of the next element. So the purging
  was delayed to the next run of the gc, fixed by Jozsef.

* A small Kconfig fix: NETFILTER_NETLINK cannot be selected and
  ipset requires it.

* hash:net,net type from Oliver Smith. The type provides the ability to
  store pairs of subnets in a set.

* Comment for ipset entries from Oliver Smith. This makes possible to
  annotate entries in a set with comments, for example:

  ipset n foo hash:net,net comment
  ipset a foo 10.0.0.0/21,192.168.1.0/24 comment "office nets A and B"

* Fix of hash types resizing with comment extension from Jozsef.

* Fix of new extensions for list:set type when an element is added
  into a slot from where another element was pushed away from Jozsef.

* Introduction of a common function for the listing of the element
  extensions from Jozsef.

* Net namespace support for ipset from Vitaly Lavrov.

* hash:net,port,net type from Oliver Smith, which makes possible
  to store the triples of two subnets and a protocol, port pair in
  a set.

* Get xt_TCPMSS working with net namespace, by Gao feng.

* Use the proper net netnamespace to allocate skbs, also by Gao feng.

* A couple of cleanups for the conntrack SIP helper, by Holger
  Eitzenberger.

* Extend cttimeout to allow setting default conntrack timeouts via
  nfnetlink, so we can get rid of all our sysctl/proc interfaces in
  the future for timeout tuning, from me.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 13:26:38 -04:00
David S. Miller e024bdc051 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter/IPVS fixes for your net
tree, they are:

* Fix BUG_ON splat due to malformed TCP packets seen by synproxy, from
  Patrick McHardy.

* Fix possible weight overflow in lblc and lblcr schedulers due to
  32-bits arithmetics, from Simon Kirby.

* Fix possible memory access race in the lblc and lblcr schedulers,
  introduced when it was converted to use RCU, two patches from
  Julian Anastasov.

* Fix hard dependency on CPU 0 when reading per-cpu stats in the
  rate estimator, from Julian Anastasov.

* Fix race that may lead to object use after release, when invoking
  ipvsadm -C && ipvsadm -R, introduced when adding RCU, from Julian
  Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01 12:39:35 -04:00
Pablo Neira Ayuso 91cb498e6a netfilter: cttimeout: allow to set/get default protocol timeouts
Default timeouts are currently set via proc/sysctl interface, the
typical pattern is a file name like:

/proc/sys/net/netfilter/nf_conntrack_PROTOCOL_timeout_STATE

This results in one entry per default protocol state timeout.
This patch simplifies this by allowing to set default protocol
timeouts via cttimeout netlink interface.

This should allow us to get rid of the existing proc/sysctl code
in the midterm.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-01 13:17:39 +02:00
holger@eitzenberger.org 180cf72f56 netfilter: nf_ct_sip: consolidate NAT hook functions
There are currently seven different NAT hooks used in both
nf_conntrack_sip and nf_nat_sip, each of the hooks is exported in
nf_conntrack_sip, then set from the nf_nat_sip NAT helper.

And because each of them is exported there is quite some overhead
introduced due of this.

By introducing nf_nat_sip_hooks I am able to reduce both text/data
somewhat.  For nf_conntrack_sip e. g. I get

        text             data              bss              dec
old    15243             5256               32            20531
new    15010             5192               32            20234

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-01 12:47:09 +02:00
Gao feng afff14f608 netfilter: nfnetlink_log: use proper net to allocate skb
Use proper net struct to allocate skb, otherwise
netlink mmap will be of no effect.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-01 12:46:56 +02:00
Gao feng 7433268783 netfilter: nfnetlink_queue: use proper net namespace to allocate skb
Use proper net struct to allocate skb, otherwise netlink mmap
will have no effect.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-01 12:20:31 +02:00
Oliver Smith 7c3ad056ef netfilter: ipset: Add hash:net,port,net module to kernel.
This adds a new set that provides similar functionality to ip,port,net
but permits arbitrary size subnets for both the first and last
parameter.

Signed-off-by: Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-09-30 21:42:58 +02:00
Vitaly Lavrov 1785e8f473 netfiler: ipset: Add net namespace for ipset
This patch adds netns support for ipset.

Major changes were made in ip_set_core.c and ip_set.h.
Global variables are moved to per net namespace.
Added initialization code and the destruction of the network namespace ipset subsystem.
In the prototypes of public functions ip_set_* added parameter "struct net*".

The remaining corrections related to the change prototypes of public functions ip_set_*.

The patch for git://git.netfilter.org/ipset.git commit 6a4ec96c0b8caac5c35474e40e319704d92ca347

Signed-off-by: Vitaly Lavrov <lve@guap.ru>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-09-30 21:42:52 +02:00
Jozsef Kadlecsik 3fd986b3d9 netfilter: ipset: Use a common function at listing the extensions
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-09-30 21:42:36 +02:00
Jozsef Kadlecsik 8ec81f9a4d netfilter: ipset: For set:list types, replaced elements must be zeroed out
The new extensions require zero initialization for the new element
to be added into a slot from where another element was pushed away.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-09-30 21:33:29 +02:00
Jozsef Kadlecsik 80571a9ea4 netfilter: ipset: Fix hash resizing with comments
The destroy function must take into account that resizing doesn't
create new extensions so those cannot be destroyed at resize.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-09-30 21:33:29 +02:00