Граф коммитов

5460 Коммитов

Автор SHA1 Сообщение Дата
David S. Miller d85427e3c8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT,
   from Michael Zhou.

2) do_ip_vs_set_ctl() dereferences uninitialized value,
   from Peilin Ye.

3) Support for userdata in tables, from Jose M. Guisado.

4) Do not increment ct error and invalid stats at the same time,
   from Florian Westphal.

5) Remove ct ignore stats, also from Florian.

6) Add ct stats for clash resolution, from Florian Westphal.

7) Bump reference counter bump on ct clash resolution only,
   this is safe because bucket lock is held, again from Florian.

8) Use ip_is_fragment() in xt_HMARK, from YueHaibing.

9) Add wildcard support for nft_socket, from Balazs Scheidler.

10) Remove superfluous IPVS dependency on iptables, from
    Yaroslav Bolyukin.

11) Remove unused definition in ebt_stp, from Wang Hai.

12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT
    in selftests/net, from Fabian Frederick.

13) Add userdata support for nft_object, from Jose M. Guisado.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 11:21:19 -07:00
Jose M. Guisado Gomez b131c96496 netfilter: nf_tables: add userdata support for nft_object
Enables storing userdata for nft_object. Initially this will store an
optional comment but can be extended in the future as needed.

Adds new attribute NFTA_OBJ_USERDATA to nft_object.

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 16:35:38 +02:00
Linus Torvalds 3e8d3bdc2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

 2) Fix loss of RTT samples in rxrpc, from David Howells.

 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

 5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

 8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

11) Memory leak in rxkad_verify_response, from Dinghao Liu.

12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
  net/smc: fix sock refcounting in case of termination
  net/smc: reset sndbuf_desc if freed
  net/smc: set rx_off for SMCR explicitly
  net/smc: fix toleration of fake add_link messages
  tg3: Fix soft lockup when tg3_reset_task() fails.
  doc: net: dsa: Fix typo in config code sample
  net: dp83867: Fix WoL SecureOn password
  nfp: flower: fix ABI mismatch between driver and firmware
  tipc: fix shutdown() of connectionless socket
  ipv6: Fix sysctl max for fib_multipath_hash_policy
  drivers/net/wan/hdlc: Change the default of hard_header_len to 0
  net: gemini: Fix another missing clk_disable_unprepare() in probe
  net: bcmgenet: fix mask check in bcmgenet_validate_flow()
  amd-xgbe: Add support for new port mode
  net: usb: dm9601: Add USB ID of Keenetic Plus DSL
  vhost: fix typo in error message
  net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
  pktgen: fix error message with wrong function name
  net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
  cxgb4: fix thermal zone device registration
  ...
2020-09-03 18:50:48 -07:00
Yaroslav Bolyukin 144b0a0e60 ipvs: remove dependency on ip6_tables
This dependency was added because ipv6_find_hdr was in iptables specific
code but is no longer required

Fixes: f8f626754e ("ipv6: Move ipv6_find_hdr() out of Netfilter code.")
Fixes: 63dca2c0b0 ("ipvs: Fix faulty IPv6 extension header handling in IPVS")
Signed-off-by: Yaroslav Bolyukin <iam@lach.pw>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-31 23:06:51 +02:00
Balazs Scheidler 67407a406d netfilter: nft_socket: add wildcard support
Add NFT_SOCKET_WILDCARD to match to wildcard socket listener.

Signed-off-by: Balazs Scheidler <bazsi77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-29 13:04:44 +02:00
Florian Westphal c46172147e netfilter: conntrack: do not auto-delete clash entries on reply
Its possible that we have more than one packet with the same ct tuple
simultaneously, e.g. when an application emits n packets on same UDP
socket from multiple threads.

NAT rules might be applied to those packets. With the right set of rules,
n packets will be mapped to m destinations, where at least two packets end
up with the same destination.

When this happens, the existing clash resolution may merge the skb that
is processed after the first has been received with the identical tuple
already in hash table.

However, its possible that this identical tuple is a NAT_CLASH tuple.
In that case the second skb will be sent, but no reply can be received
since the reply that is processed first removes the NAT_CLASH tuple.

Do not auto-delete, this gives a 1 second window for replies to be passed
back to originator.

Packets that are coming later (udp stream case) will not be affected:
they match the original ct entry, not a NAT_CLASH one.

Also prevent NAT_CLASH entries from getting offloaded.

Fixes: 6a757c07e5 ("netfilter: conntrack: allow insertion of clashing entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-29 13:03:06 +02:00
Pablo Neira Ayuso ee92118355 netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFS
Frontend callback reports EAGAIN to nfnetlink to retry a command, this
is used to signal that module autoloading is required. Unfortunately,
nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
full, so it enters a busy-loop.

This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
since this is always MSG_DONTWAIT in the existing code which is exactly
what nlmsg_unicast() passes to netlink_unicast() as parameter.

Fixes: 96518518cc ("netfilter: add nftables")
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 20:11:58 +02:00
Randy Dunlap 4b7ddc58e6 netfilter: delete repeated words
Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 20:11:38 +02:00
YueHaibing f5143e10a2 netfilter: xt_HMARK: Use ip_is_fragment() helper
Use ip_is_fragment() to simpify code.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:55:51 +02:00
Florian Westphal ff73e7479b netfilter: conntrack: remove unneeded nf_ct_put
We can delay refcount increment until we reassign the existing entry to
the current skb.

A 0 refcount can't happen while the nf_conn object is still in the
hash table and parallel mutations are impossible because we hold the
bucket lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:27 +02:00
Florian Westphal bc92470413 netfilter: conntrack: add clash resolution stat counter
There is a misconception about what "insert_failed" means.

We increment this even when a clash got resolved, so it might not indicate
a problem.

Add a dedicated counter for clash resolution and only increment
insert_failed if a clash cannot be resolved.

For the old /proc interface, export this in place of an older stat
that got removed a while back.
For ctnetlink, export this with a new attribute.

Also correct an outdated comment that implies we add a duplicate tuple --
we only add the (unique) reply direction.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:26 +02:00
Florian Westphal 4afc41dfa5 netfilter: conntrack: remove ignore stats
This counter increments when nf_conntrack_in sees a packet that already
has a conntrack attached or when the packet is marked as UNTRACKED.
Neither is an error.

The former is normal for loopback traffic.  The second happens for
certain ICMPv6 packets or when nftables/ip(6)tables rules are in place.

In case someone needs to count UNTRACKED packets, or packets
that are marked as untracked before conntrack_in this can be done with
both nftables and ip(6)tables rules.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:26 +02:00
Florian Westphal b1328e54ac netfilter: conntrack: do not increment two error counters at same time
The /proc interface for nf_conntrack displays the "error" counter as
"icmp_error".

It makes sense to not increment "invalid" when failing to handle an icmp
packet since those are special.

For example, its possible for conntrack to see partial and/or fragmented
packets inside icmp errors.  This should be a separate event and not get
mixed with the "invalid" counter.

Likewise, remove the "error" increment for errors from get_l4proto().
After this, the error counter will only increment for errors coming from
icmp(v6) packet handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:25 +02:00
Jose M. Guisado Gomez 7a81575b80 netfilter: nf_tables: add userdata attributes to nft_table
Enables storing userdata for nft_table. Field udata points to user data
and udlen store its length.

Adds new attribute flag NFTA_TABLE_USERDATA

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:18:48 +02:00
Peilin Ye c5a8a8498e ipvs: Fix uninit-value in do_ip_vs_set_ctl()
do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
zero. Fix it.

Reported-by: syzbot+23b5f9e7caf61d9a3898@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=46ebfb92a8a812621a001ef04d90dfa459520fe2
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:18:48 +02:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Florian Westphal 1e105e6afa netfilter: nf_tables: fix destination register zeroing
Following bug was reported via irc:
nft list ruleset
   set knock_candidates_ipv4 {
      type ipv4_addr . inet_service
      size 65535
      elements = { 127.0.0.1 . 123,
                   127.0.0.1 . 123 }
      }
 ..
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }

It should not have been possible to add a duplicate set entry.

After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.

Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted

inet firewall @knock_candidates_ipv4
        element 0100007f ffff7b00  : 0 [end]
        element 0100007f 00007b00  : 0 [end]

Note the non-zero upper bits of the first element.  It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.

Furthermore, the zeroing in nft_payload is broken.  We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.

Skip zeroing in this case and use a conditional instead of (len -1) / 4.

Fixes: 49499c3e6e ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 19:00:33 +02:00
Pablo Neira Ayuso 6f03bf43ee netfilter: nf_tables: add NFTA_SET_USERDATA if not null
Kernel sends an empty NFTA_SET_USERDATA attribute with no value if
userspace adds a set with no NFTA_SET_USERDATA attribute.

Fixes: e6d8ecac9e ("netfilter: nf_tables: Add new attributes into nft_set to store user data.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 19:00:14 +02:00
Stefano Brivio 0726763043 netfilter: nft_set_rbtree: Detect partial overlap with start endpoint match
Getting creative with nft and omitting the interval_overlap()
check from the set_overlap() function, without omitting
set_overlap() altogether, led to the observation of a partial
overlap that wasn't detected, and would actually result in
replacement of the end element of an existing interval.

This is due to the fact that we'll return -EEXIST on a matching,
pre-existing start element, instead of -ENOTEMPTY, and the error
is cleared by API if NLM_F_EXCL is not given. At this point, we
can insert a matching start, and duplicate the end element as long
as we don't end up into other intervals.

For instance, inserting interval 0 - 2 with an existing 0 - 3
interval would result in a single 0 - 2 interval, and a dangling
'3' end element. This is because nft will proceed after inserting
the '0' start element as no error is reported, and no further
conflicting intervals are detected on insertion of the end element.

This needs a different approach as it's a local condition that can
be detected by looking for duplicate ends coming from left and
right, separately. Track those and directly report -ENOTEMPTY on
duplicated end elements for a matching start.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 17:37:36 +02:00
Stefano Brivio 226a88de47 netfilter: nft_set_rbtree: Handle outcomes of tree rotations in overlap detection
Checks for partial overlaps on insertion assume that end elements
are always descendant nodes of their corresponding start, because
they are inserted later. However, this is not the case if a
previous delete operation caused a tree rotation as part of
rebalancing.

Taking the issue reported by Andreas Fischer as an example, if we
omit delete operations, the existing procedure works because,
equivalently, we are inserting a start item with value 40 in the
this region of the red-black tree with single-sized intervals:

                                  overlap flag
                   10 (start)
                  /  \            false
                      20 (start)
                     /  \         false
                         30 (start)
                        /  \      false
                            60 (start)
                           /  \   false
                         50 (end)
                        /  \      false
                      20 (end)
                     /  \         false
                         40 (start)

if we now delete interval 30 - 30, the tree can be rearranged in
a way similar to this (note the rotation involving 50 - 50):

                                  overlap flag
                   10 (start)
                  /  \            false
                      20 (start)
                     /  \         false
                         25 (start)
                        /  \      false
                            70 (start)
                           /  \   false
                         50 (end)
                        /  \      true (from rule a1.)
                      50 (start)
                     /  \         true
                   40 (start)

and we traverse interval 50 - 50 from the opposite direction
compared to what was expected.

To deal with those cases, add a start-before-start rule, b4.,
that covers traversal of existing intervals from the right.

We now need to restrict start-after-end rule b3. to cases
where there are no occurring nodes between existing start and
end elements, because addition of rule b4. isn't sufficient to
ensure that the pre-existing end element we encounter while
descending the tree corresponds to a start element of an
interval that we already traversed entirely.

Different types of overlap detection on trees with rotations
resulting from re-balancing will be covered by nft test case
sets/0044interval_overlap_1.

Reported-by: Andreas Fischer <netfilter@d9c.eu>
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1449
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 17:36:52 +02:00
Florian Westphal cc5453a5b7 netfilter: conntrack: allow sctp hearbeat after connection re-use
If an sctp connection gets re-used, heartbeats are flagged as invalid
because their vtag doesn't match.

Handle this in a similar way as TCP conntrack when it suspects that the
endpoints and conntrack are out-of-sync.

When a HEARTBEAT request fails its vtag validation, flag this in the
conntrack state and accept the packet.

When a HEARTBEAT_ACK is received with an invalid vtag in the reverse
direction after we allowed such a HEARTBEAT through, assume we are
out-of-sync and re-set the vtag info.

v2: remove left-over snippet from an older incarnation that moved
    new_state/old_state assignments, thats not needed so keep that
    as-is.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-20 14:13:49 +02:00
David S. Miller 8c26544f5a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Endianness issue in IPv4 option support in nft_exthdr,
   from Stephen Suryaputra.

2) Removes the waitcount optimization in nft_compat,
   from Florian Westphal.

3) Remove ipv6 -> nf_defrag_ipv6 module dependency, from
   Florian Westphal.

4) Memleak in chain binding support, also from Florian.

5) Simplify nft_flowtable.sh selftest, from Fabian Frederick.

6) Optional MTU arguments for selftest nft_flowtable.sh,
   also from Fabian.

7) Remove noise error report when killing process in
   selftest nft_flowtable.sh, from Fabian Frederick.

8) Reject bogus getsockopt option length in ebtables,
   from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 16:05:36 -07:00
Florian Westphal 59136aa3b2 netfilter: nf_tables: free chain context when BINDING flag is missing
syzbot found a memory leak in nf_tables_addchain() because the chain
object is not free'd correctly on error.

Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Reported-by: syzbot+c99868fde67014f7e9f5@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-13 04:17:46 +02:00
Linus Torvalds 97d052ea3f A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in various
     situations caused by the lockdep additions to seqcount to validate that
     the write side critical sections are non-preemptible.
 
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
 
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict per
     CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
     validate that the lock is held.
 
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored and
     write_seqcount_begin() has a lockdep assertion to validate that the
     lock is held.
 
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API is
     unchanged and determines the type at compile time with the help of
     _Generic which is possible now that the minimal GCC version has been
     moved up.
 
     Adding this lockdep coverage unearthed a handful of seqcount bugs which
     have been addressed already independent of this.
 
     While generaly useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if the
     writers are serialized by an associated lock, which leads to the well
     known reader preempts writer livelock. RT prevents this by storing the
     associated lock pointer independent of lockdep in the seqcount and
     changing the reader side to block on the lock when a reader detects
     that a writer is in the write side critical section.
 
  - Conversion of seqcount usage sites to associated types and initializers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
 QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
 KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
 shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
 n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
 F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
 DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
 AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
 81ppn2KlvzUK8g==
 =7Gj+
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:

   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.

   - The seqcount associated lock debug addons which were blocked by the
     above fallout.

     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.

     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.

     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.

     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.

     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.

   - Conversion of seqcount usage sites to associated types and
     initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
  x86/headers: Remove APIC headers from <asm/smp.h>
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ...
2020-08-10 19:07:44 -07:00
Florian Westphal 2f941622fd netfilter: nft_compat: remove flush counter optimization
WARNING: CPU: 1 PID: 16059 at lib/refcount.c:31 refcount_warn_saturate+0xdf/0xf
[..]
 __nft_mt_tg_destroy+0x42/0x50 [nft_compat]
 nft_target_destroy+0x63/0x80 [nft_compat]
 nf_tables_expr_destroy+0x1b/0x30 [nf_tables]
 nf_tables_rule_destroy+0x3a/0x70 [nf_tables]
 nf_tables_exit_net+0x186/0x3d0 [nf_tables]

Happens when a compat expr is destoyed from abort path.
There is no functional impact; after this work queue is flushed
unconditionally if its pending.

This removes the waitcount optimization.  Test of repeated
iptables-restore of a ~60k kubernetes ruleset doesn't indicate
a slowdown.  In case the counter is needed after all for some workloads
we can revert this and increment the refcount for the
!= NFT_PREPARE_TRANS case to avoid the increment/decrement imbalance.

While at it, also flush for match case, this was an oversight
in the original patch.

Fixes: ffe8923f10 ("netfilter: nft_compat: make sure xtables destructors have run")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-10 13:03:36 +02:00
Stephen Suryaputra b428336676 netfilter: nf_tables: nft_exthdr: the presence return value should be little-endian
On big-endian machine, the returned register data when the exthdr is
present is not being compared correctly because little-endian is
assumed. The function nft_cmp_fast_mask(), called by nft_cmp_fast_eval()
and nft_cmp_fast_init(), calls cpu_to_le32().

The following dump also shows that little endian is assumed:

$ nft --debug=netlink add rule ip recordroute forward ip option rr exists counter
ip
  [ exthdr load ipv4 1b @ 7 + 0 present => reg 1 ]
  [ cmp eq reg 1 0x01000000 ]
  [ counter pkts 0 bytes 0 ]

Lastly, debug print in nft_cmp_fast_init() and nft_cmp_fast_eval() when
RR option exists in the packet shows that the comparison fails because
the assumption:

nft_cmp_fast_init:189 priv->sreg=4 desc.len=8 mask=0xff000000 data.data[0]=0x10003e0
nft_cmp_fast_eval:57 regs->data[priv->sreg=4]=0x1 mask=0xff000000 priv->data=0x1000000

v2: use nft_reg_store8() instead (Florian Westphal). Also to avoid the
    warnings reported by kernel test robot.

Fixes: dbb5281a1f ("netfilter: nf_tables: add support for matching IPv4 options")
Fixes: c078ca3b0c ("netfilter: nft_exthdr: Add support for existence check")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-10 13:02:43 +02:00
Linus Torvalds 47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Linus Torvalds fd76a74d94 audit/stable-5.9 PR 20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl8okpIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNqOQ/8D+m9Ykcby3csEKsp8YtsaukEu62U
 lRVaxzRNO9wwB24aFwDFuJnIkmsSi/s/O4nBsy2mw+Apn+uDCvHQ9tBU07vlNn2f
 lu27YaTya7YGlqoe315xijd8tyoX99k8cpQeixvAVr9/jdR09yka7SJ8O7X9mjV7
 +SUVDiKCplPKpiwCCRS9cqD7F64T6y35XKzbrzYqdP0UOF2XelZo/Evt5rDRvWUf
 5qDN2tP+iM/Fvu5lCfczFwAeivfAdxjQ11n783hx8Ms2qyiaKQCzbEwjqAslmkbs
 1k/+ED0NjzXX1ne0JZaz/bk0wsMnmOoa8o+NDcyd7Za/cj5prUZi7kBy+xry4YV8
 qKJ40Lk0flCWgUpm6bkYVOByIYHk0gmfBNvjilqf25NR/eOC/9e9ir8PywvYUW/7
 kvVK37+N/a3LnFj80sZpIeqqnNU8z9PV1i7//5/kDuKvz94Bq83TJDO6pPKvqDtC
 njQfCFoHwdEeF8OalK793lIiYaoODqvbkWKChKMqziODJ4ZP8AW06gXpEbEWn7G3
 TTnJx7hqzR9t90vBQJeO3Fromfn+9TDlZVdX+EGO8gIqUiLGr0r7LPPep4VkDbNw
 LxMYKeC2cgRp8Z+XXPDxfXSDL2psTwg6CXcDrXcYnUyBo/yerpBvbJkeaR0h+UR0
 j6cvMX+T39X2JXM=
 =Xs3M
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Aside from some smaller bug fixes, here are the highlights:

   - add a new backlog wait metric to the audit status message, this is
     intended to help admins determine how long processes have been
     waiting for the audit backlog queue to clear

   - generate audit records for nftables configuration changes

   - generate CWD audit records for for the relevant LSM audit records"

* tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: report audit wait metric in audit status reply
  audit: purge audit_log_string from the intra-kernel audit API
  audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
  audit: use the proper gfp flags in the audit_log_nfcfg() calls
  audit: remove unused !CONFIG_AUDITSYSCALL __audit_inode* stubs
  audit: add gfp parameter to audit_log_nfcfg
  audit: log nftables configuration change events
  audit: Use struct_size() helper in alloc_chunk
2020-08-04 14:20:26 -07:00
Linus Torvalds 99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
David S. Miller ee895a30ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Flush the cleanup xtables worker to make sure destructors
   have completed, from Florian Westphal.

2) iifgroup is matching erroneously, also from Florian.

3) Add selftest for meta interface matching, from Florian Westphal.

4) Move nf_ct_offload_timeout() to header, from Roi Dayan.

5) Call nf_ct_offload_timeout() from flow_offload_add() to
   make sure garbage collection does not evict offloaded flow,
   from Roi Dayan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:32:39 -07:00
David S. Miller f2e0b29a9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) UAF in chain binding support from previous batch, from Dan Carpenter.

2) Queue up delayed work to expire connections with no destination,
   from Andrew Sy Kim.

3) Use fallthrough pseudo-keyword, from Gustavo A. R. Silva.

4) Replace HTTP links with HTTPS, from Alexander A. Klimov.

5) Remove superfluous null header checks in ip6tables, from
   Gaurav Singh.

6) Add extended netlink error reporting for expression.

7) Report EEXIST on overlapping chain, set elements and flowtable
   devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 16:03:18 -07:00
Roi Dayan 4203b19c27 netfilter: flowtable: Set offload timeout when adding flow
On heavily loaded systems the GC can take time to go over all existing
conns and reset their timeout. At that time other calls like from
nf_conntrack_in() can call of nf_ct_is_expired() and see the conn as
expired. To fix this when we set the offload bit we should also reset
the timeout instead of counting on GC to finish first iteration over
all conns before the initial timeout.

Fixes: 90964016e5 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-03 12:37:24 +02:00
Roi Dayan 73f9407b3e netfilter: conntrack: Move nf_ct_offload_timeout to header file
To be used by callers from other modules.

[ Rename DAY to NF_CT_DAY to avoid possible symbol name pollution
  issue --Pablo ]

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-03 12:36:47 +02:00
Pablo Neira Ayuso 77a92189ec netfilter: nf_tables: report EEXIST on overlaps
Replace EBUSY by EEXIST in the following cases:

- If the user adds a chain with a different configuration such as different
  type, hook and priority.

- If the user adds a non-base chain that clashes with an existing basechain.

- If the user adds a { key : value } mapping element and the key exists
  but the value differs.

- If the device already belongs to an existing flowtable.

User describe that this error reporting is confusing:

- https://bugzilla.netfilter.org/show_bug.cgi?id=1176
- https://bugzilla.netfilter.org/show_bug.cgi?id=1413

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-02 19:53:45 +02:00
Florian Westphal 78470d9d0d netfilter: nft_meta: fix iifgroup matching
iifgroup matching erroneously checks the output interface.

Fixes: 8724e819cc ("netfilter: nft_meta: move all interface related keys to helper")
Reported-by: Demi M. Obenour <demiobenour@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-02 04:10:20 +02:00
Pablo Neira Ayuso 83d9dcba06 netfilter: nf_tables: extended netlink error reporting for expressions
This patch extends 36dd1bcc07 ("netfilter: nf_tables: initial support
for extended ACK reporting") to include netlink extended error reporting
for expressions. This allows userspace to identify what rule expression
is triggering the error.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-02 03:23:20 +02:00
Florian Westphal ffe8923f10 netfilter: nft_compat: make sure xtables destructors have run
Pablo Neira found that after recent update of xt_IDLETIMER the
iptables-nft tests sometimes show an error.

He tracked this down to the delayed cleanup used by nf_tables core:
del rule (transaction A)
add rule (transaction B)

Its possible that by time transaction B (both in same netns) runs,
the xt target destructor has not been invoked yet.

For native nft expressions this is no problem because all expressions
that have such side effects make sure these are handled from the commit
phase, rather than async cleanup.

For nft_compat however this isn't true.

Instead of forcing synchronous behaviour for nft_compat, keep track
of the number of outstanding destructor calls.

When we attempt to create a new expression, flush the cleanup worker
to make sure destructors have completed.

With lots of help from Pablo Neira.

Reported-by: Pablo Neira Ayso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-31 19:28:00 +02:00
Alexander A. Klimov 50935339c3 netfilter: Replace HTTP links with HTTPS ones
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
            If both the HTTP and HTTPS versions
            return 200 OK and serve the same content:
              Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-29 20:09:18 +02:00
Ahmed S. Darwish b901892b51 netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_rwlock_t data type, which allows to associate a
rwlock with the sequence counter. This enables lockdep to verify that
the rwlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-16-a.darwish@linutronix.de
2020-07-29 16:14:26 +02:00
Ahmed S. Darwish 8201d923f4 netfilter: conntrack: Use sequence counter with associated spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-15-a.darwish@linutronix.de
2020-07-29 16:14:26 +02:00
Christoph Hellwig d3c4815151 net: remove sockptr_advance
sockptr_advance never properly worked.  Replace it with _offset variants
of copy_from_sockptr and copy_to_sockptr.

Fixes: ba423fdaa5 ("net: add a new sockptr_t type")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 13:43:40 -07:00
David S. Miller a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Christoph Hellwig c2f12630c6 netfilter: switch nf_setsockopt to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:54 -07:00
Christoph Hellwig ab214d1bf8 netfilter: switch xt_copy_counters to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:53 -07:00
guodeqing 8210e344cc ipvs: fix the connection sync failed in some cases
The sync_thread_backup only checks sk_receive_queue is empty or not,
there is a situation which cannot sync the connection entries when
sk_receive_queue is empty and sk_rmem_alloc is larger than sk_rcvbuf,
the sync packets are dropped in __udp_enqueue_schedule_skb, this is
because the packets in reader_queue is not read, so the rmem is
not reclaimed.

Here I add the check of whether the reader_queue of the udp sock is
empty or not to solve this problem.

Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Reported-by: zhouxudong <zhouxudong8@huawei.com>
Signed-off-by: guodeqing <geffrey.guo@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-22 01:21:34 +02:00
Gustavo A. R. Silva 954d82979b netfilter: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-22 01:18:05 +02:00
Andrew Sy Kim 35dfb01314 ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.

Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-22 01:17:59 +02:00
Christoph Hellwig 657e4c34a2 netfilter: split nf_sockopt
Split nf_sockopt into a getsockopt and setsockopt side as they share
very little code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:40 -07:00
Christoph Hellwig c34bc10d25 netfilter: remove the compat argument to xt_copy_counters_from_user
Lift the in_compat_syscall() from the callers instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:40 -07:00
Christoph Hellwig 77d4df41d5 netfilter: remove the compat_{get,set} methods
All instances handle compat sockopts via in_compat_syscall() now, so
remove the compat_{get,set} methods as well as the
compat_nf_{get,set}sockopt wrappers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:40 -07:00