WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Arnd Bergmann	0ef049dc11	mac80211: avoid excessive stack usage in sta_info When CONFIG_OPTIMIZE_INLINING is set, the sta_info_insert_finish function consumes more stack than normally, exceeding the 1024 byte limit on ARM: net/mac80211/sta_info.c: In function 'sta_info_insert_finish': net/mac80211/sta_info.c:561:1: error: the frame size of 1080 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] It turns out that there are two functions that put a 'struct station_info' on the stack: __sta_info_destroy_part2 and sta_info_insert_finish, and this structure alone requires up to 792 bytes. Hoping that both are called rarely enough, this replaces the on-stack structure with a dynamic allocation, which unfortunately requires some suboptimal error handling for out-of-memory. The __sta_info_destroy_part2 function is actually affected by the stack usage twice because it calls cfg80211_del_sta_sinfo(), which has another instance of struct station_info on its stack. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `98b6218388` ("mac80211/cfg80211: add station events") Fixes: `6f7a8d26e2` ("mac80211: send statistics with delete station event") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:28 +01:00
Johannes Berg	89f774e6c4	mac80211: always print a message when disconnecting Make sure there's at least a debug message whenever the connection to the AP is terminated. Also change one message from wiphy_debug() to the common mlme_dbg(). Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:28 +01:00
Sara Sharon	d321cd014e	mac80211: fix ibss scan parameters When joining IBSS a full scan should be initiated in order to search for existing cell, unless the fixed_channel parameter was set. A default channel to create the IBSS on if no cell was found is provided as well. However - a scan is initiated only on the default channel provided regardless of whether ifibss->fixed_channel is set or not, with the obvious result of the cell not joining existing IBSS cell that is on another channel. Fixes: `76bed0f43b` ("mac80211: IBSS fix scan request") Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:27 +01:00
Johannes Berg	f4a0f0c526	mac80211: add RX_FLAG_MACTIME_PLCP_START The timestamp given by iwlwifi is at the beginning of the frame over the air, at (or during) the SYNC field. Allow such timestamps to be given to mac80211, at least (for now) for frames with non-HT/VHT preambles. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:27 +01:00
Michal Kazior	cf44012810	mac80211: fix unnecessary frame drops in mesh fwding The ieee80211_queue_stopped() expects hw queue number but it was given raw WMM AC number instead. This could cause frame drops and problems with traffic in some cases - most notably if driver doesn't map AC numbers to queue numbers 1:1 and uses ieee80211_stop_queues() and ieee80211_wake_queue() only without ever calling ieee80211_wake_queues(). On ath10k it was possible to hit this problem in the following case: 1. wlan0 uses queue 0 (ath10k maps queues per vif) 2. offchannel uses queue 15 3. queues 1-14 are unused 4. ieee80211_stop_queues() 5. ieee80211_wake_queue(q=0) 6. ieee80211_wake_queue(q=15) (other queues are not woken up because both driver and mac80211 know other queues are unused) 7. ieee80211_rx_h_mesh_fwding() 8. ieee80211_select_queue_80211() returns 2 9. ieee80211_queue_stopped(q=2) returns true 10. frame is dropped (oops!) Fixes: `d3c1597b8d` ("mac80211: fix forwarded mesh frame queue mapping") Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:26 +01:00
Michal Kazior	2a58d42c1e	mac80211: fix txq queue related crashes The driver can access the queue simultanously while mac80211 tears down the interface. Without spinlock protection this could lead to corrupting sk_buff_head and subsequently to an invalid pointer dereference. Fixes: `ba8c3d6f16` ("mac80211: add an intermediate software queue implementation") Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:26 +01:00
Sunil Shahu	b8631c0033	mac80211: mesh_plink: remove redundant sta_info check Remove unnecessory "if" statement and club it with previos "if" block. Signed-off-by: Sunil Shahu <shshahu@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:25 +01:00
João Paulo Rechi Vita	e2a35e8929	rfkill: Remove obsolete "claim" sysfs interface This was scheduled to be removed in 2012 by: commit `69c86373c6` Author: florian@mickler.org <florian@mickler.org> Date: Wed Feb 24 12:05:16 2010 +0100 Document the rfkill sysfs ABI This moves sysfs ABI info from Documentation/rfkill.txt to the ABI subfolder and reformats it. This also schedules the deprecated sysfs parts to be removed in 2012 (claim file) and 2014 (state file). Signed-off-by: Florian Mickler <florian@mickler.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:24 +01:00
João Paulo Rechi Vita	1926e260d8	rfkill: remove/inline __rfkill_set_hw_state __rfkill_set_hw_state() is only one used in rfkill_set_hw_state(), and none of them are long or complicated, so merging the two makes the code easier to read. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:24 +01:00
João Paulo Rechi Vita	f3e7fae248	rfkill: use variable instead of duplicating the expression RFKILL_BLOCK_SW value have just been saved to prev, no need to check it again in the if expression. This makes code a little bit easier to read. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:23 +01:00
Ola Olsson	573a2b51ac	cfg80211: Fix some linguistics in Kconfig Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:23 +01:00
Johannes Berg	dd21dfc645	rfkill: disentangle polling pause and suspend When suspended while polling is paused, polling will erroneously resume at resume time. Fix this by tracking pause and suspend in separate state variable and adding the necessary checks. Clarify the documentation on this as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:22 +01:00
Johannes Berg	8ac3c70419	mac80211: refactor HT/VHT to chandef code The station MLME and IBSS/mesh ones use entirely different code for interpreting HT and VHT operation elements. Change the code that interprets them a bit - it now modifies an existing chandef - and use it also in the MLME code. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:21 +01:00
Ola Olsson	de3bb771f4	cfg80211: add more warnings for inconsistent ops Print a warning whenever an expected callback function lacks implementation. Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:20 +01:00
Emmanuel Grumbach	506bcfa8ab	mac80211: limit the A-MSDU Tx based on peer's capabilities In VHT, the specification allows to limit the number of MSDUs in an A-MSDU in the Extended Capabilities IE. There is also a limitation on the byte size in the VHT IE. In HT, the only limitation is on the byte size. Parse the capabilities from the peer and make them available to the driver. In HT, there is another limitation when a BA agreement is active: the byte size can't be greater than 4095. This is not enforced here. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:20 +01:00
Ilan Peer	a7201a6c5e	mac80211: Recalc min chandef when station is associated The minimum chandef bandwidth calculation was done only in case a new station was inserted (or when an existing station was removed). However, it is possible that stations are inserted before they are associated, e.g., when FULL_AP_CLIENT_STATE is supported and user space adds stations unassociated. Fix this by calling ieee80211_recalc_min_chandef() whenever a station transitions in/out the associated state, and only consider station marked as associated. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:19 +01:00
Grzegorz Bajorski	178830481e	mac80211: allow drivers to report (non-)monitor frames Some drivers offload some frames internally (e.g. AddBa). Reporting such frames to mac80211 would only confuse MLME. However it would be useful to be able to pass such frames to monitor interfaces for sniffing purposes, e.g. when running AP + monitor. To do that allow drivers to tell mac80211 whether a given frame should be: - processed but not delivered to any monitor vif - not processed but delievered to monitor vifs only Signed-off-by: Grzegorz Bajorski <grzegorz.bajorski@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:19 +01:00
Sara Sharon	412a6d800c	mac80211: support hw managing reorder logic Enable driver to manage the reordering logic itself. This is needed for example for the iwlwifi driver that will support hardware assisted reordering. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-24 09:04:16 +01:00
Linus Torvalds	420eb6d7ef	NFS client bugfixes for Linux 4.5 Stable bugfixes: - Fix nfs_size_to_loff_t - NFSv4: Fix a dentry leak on alias use Other bugfixes: - Don't schedule a layoutreturn if the layout segment can be freed immediately. - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode - rpcrdma_bc_receive_call() should init rq_private_buf.len - fix stateid handling for the NFS v4.2 operations - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page - fix panic in gss_pipe_downcall() in fips mode - Fix a race between layoutget and pnfs_destroy_layout - Fix a race between layoutget and bulk recalls -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWzKShAAoJEGcL54qWCgDyN0QQALiX8v2wvn07vE5ZeXB5uONq +mfx8avhEoc3NVrpG6F4Kj+yJmHeAbkgIygnhZn4tcM/2YRxGDwlVLHb++yUTHO9 8zEi+tiKx9f5pK2PxRQ0PjavVxO/xOyO0/QNrUdnj8hSNR9ow+YOVjEYUulbuhIg VAI3oSy5qIKgtDyW7w5PuPpTXLo74hPmyqHaa+ZIr2et//nJMSsw++vAmSg3oqXq 6QkLWPHt/8yvDRRn2hKkbD9gOrFCVfaZIGLM6Q0zRWAcGTzJi94ELzPdm8cVpD1o eXKcufgLXPt3GOeAmxZ9kwQeebR6IFcvkYom5dsPhtMBuzXu1wpanU8PGgYIQ0VA 88b2YNl+TZpiVbRzxSEellZq5b+zapH/VVVnYptZiq9wUTACc7jK6W2heqe5PzaT iepTGCAE21tV5JewcITMQHDZiOjRNdtbBzgixI7pNfMN8whU6e5NHYj6psZqT7cf xEEZzL+RBJuCFKhXSPbBefccA4HCRkDEpT+2QgrMbS4KKfWOg36UNbJ2kgbvcRVi HTqoRONR6zMzYBhyMlLaUuJ1co8nSHgEsL81Q3MwWSY6gucSW7jeJ2stR20KJIo1 7qgod9Ac/BAIozjzywi0LtmxouPyPU8cqaboMhSRVPDKfFlqZBNBkFLNWwgoYXMa r1afZQwNeRRbZUR3RulE =/WDS -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Stable bugfixes: - Fix nfs_size_to_loff_t - NFSv4: Fix a dentry leak on alias use Other bugfixes: - Don't schedule a layoutreturn if the layout segment can be freed immediately. - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode - rpcrdma_bc_receive_call() should init rq_private_buf.len - fix stateid handling for the NFS v4.2 operations - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page - fix panic in gss_pipe_downcall() in fips mode - Fix a race between layoutget and pnfs_destroy_layout - Fix a race between layoutget and bulk recalls" * tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout auth_gss: fix panic in gss_pipe_downcall() in fips mode pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page nfs4: fix stateid handling for the NFS v4.2 operations NFSv4: Fix a dentry leak on alias use xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode pNFS: Fix pnfs_mark_matching_lsegs_return() nfs: fix nfs_size_to_loff_t	2016-02-23 16:39:21 -08:00
Bernie Harris	5146d1f151	tunnel: Clear IPCB(skb)->opt before dst_link_failure called IPCB may contain data from previous layers (in the observed case the qdisc layer). In the observed scenario, the data was misinterpreted as ip header options, which later caused the ihl to be set to an invalid value (<5). This resulted in an infinite loop in the mips implementation of ip_fast_csum. This patch clears IPCB(skb)->opt before dst_link_failure can be called for various types of tunnels. This change only applies to encapsulated ipv4 packets. The code introduced in `11c21a30` which clears all of IPCB has been removed to be consistent with these changes, and instead the opt field is cleared unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to SIT, GRE, and IPIP tunnels. The relevant vti, l2tp, and pptp functions already contain similar code for clearing the IPCB. Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-23 19:11:56 -05:00
Konstantin Khlebnikov	9bdfb3b79e	tcp: convert cached rtt from usec to jiffies when feeding initial rto Currently it's converted into msecs, thus HZ=1000 intact. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: `740b0f1841` ("tcp: switch rtt estimations to usec resolution") Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-23 18:28:46 -05:00
Vivien Didelot	9d2dd73669	net: dsa: remove dsa_bridge_check_vlan_range DSA drivers may support multiple bridge groups with the same hardware VLAN. The mv88e6xxx driver which cannot yet, already has its own check for overlapping bridges. Thus remove the check from the DSA layer. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-23 14:52:46 -05:00
Vivien Didelot	a6692754d6	net: dsa: pass bridge down to drivers Some DSA drivers may or may not support multiple software bridges on top of an hardware switch. It is more convenient for them to access the bridge's net_device for finer configuration. Removing the need to craft and access a bitmask also simplifies the code. This patch changes the signature of bridge related functions, update DSA drivers, and removes dsa_slave_br_port_mask. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-23 14:52:46 -05:00
Alexander Aring	ebba380cc9	ieee802154: 6lowpan: fix return of netdev notifier This patch fixed the return value of netdev notifier. If the command is a don't care a NOTIFY_DONE should be returned. If the command matched a NOTIFY_OK should be returned. Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:40 +01:00
Alexander Aring	5609c185f2	6lowpan: iphc: add support for stateful compression This patch introduce support for IPHC stateful address compression. It will offer the context table via one debugfs entry. This debugfs has and directory for each cid entry for the context table. Inside each cid directory there exists the following files: - "active": If the entry is added or deleted. The context table is original a list implementation, this flag will indicate if the context is part of list or not. - "prefix": The ipv6 prefix. - "prefix_length": The prefix length for the prefix. - "compression": The compression flag according RFC6775. This part should be moved into sysfs after some testing time. Also the debugfs entry contains a "show" file which is a pretty-printout for the current context table information. Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:40 +01:00
Koen Zandberg	aef00c15b8	mac802154: Fixes kernel oops when unloading a radio driver Destroying the workqueue before unregistering the net device caused a kernel oops Signed-off-by: Koen Zandberg <koen@bergzand.net> Acked-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:40 +01:00
Wei-Ning Huang	d82142a8b1	Bluetooth: hci_core: cancel power off delayed work properly When the HCI_AUTO_OFF flag is cleared, the power_off delayed work need to be cancel or HCI will be powered off even if it's managed. Signed-off-by: Wei-Ning Huang <wnhuang@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:38 +01:00
Heiner Kallweit	b6e402fc84	Bluetooth: Use managed version of led_trigger_register in LED trigger Recently a managed version of led_trigger_register was introduced. Using devm_led_trigger_register allows to simplify the LED trigger code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:36 +01:00
Heiner Kallweit	6d5d2ee63c	Bluetooth: add LED trigger for indicating HCI is powered up Add support for LED triggers to the Bluetooth subsystem and add kernel config symbol BT_LEDS for it. For now one trigger for indicating "HCI is powered up" is supported. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-23 20:29:35 +01:00
Stefan Hajnoczi	b7052cd7bc	sunrpc/cache: fix off-by-one in qword_get() The qword_get() function NUL-terminates its output buffer. If the input string is in hex format \xXXXX... and the same length as the output buffer, there is an off-by-one: int qword_get(char *bpp, char dest, int bufsize) { ... while (len < bufsize) { ... dest++ = (h << 4) \| l; len++; } ... dest = '\0'; return len; } This patch ensures the NUL terminator doesn't fall outside the output buffer. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-02-23 13:20:16 -05:00
Arend van Spriel	b86071528f	cfg80211: stop critical protocol session upon disconnect event When user-space has started a critical protocol session and a disconnect event occurs, the rdev::crit_prot_nlportid remains set. This caused a subsequent NL80211_CMD_CRIT_PROTO_START to fail (-EBUSY). Fix this by clearing the rdev attribute and call .crit_proto_stop() callback upon disconnect event. Reviewed-by: Hante Meuleman <meuleman@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-23 10:41:24 +01:00
Ola Olsson	5e950a78bf	nl80211: Zero out the connection keys memory when freeing them. The connection keys are zeroed out in all other cases except this one. Let's fix the last one as well. Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com> Reviewed-by: Julian Calaby <julian.calaby@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-23 10:40:36 +01:00
Felix Fietkau	7a36b930e6	mac80211: minstrel_ht: set default tx aggregation timeout to 0 The value 5000 was put here with the addition of the timeout field to ieee80211_start_tx_ba_session. It was originally added in mac80211 to save resources for drivers like iwlwifi, which only supports a limited number of concurrent aggregation sessions. Since iwlwifi does not use minstrel_ht and other drivers don't need this, 0 is a better default - especially since there have been recent reports of aggregation setup related issues reproduced with ath9k. This should improve stability without causing any adverse effects. Cc: stable@vger.kernel.org Acked-by: Avery Pennarun <apenwarr@gmail.com> Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-02-23 10:39:24 +01:00
Sven Eckelmann	7e2366c626	batman-adv: Rename batadv_tt_orig_list_entry _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:51:01 +08:00
Sven Eckelmann	5dafd8a6cc	batman-adv: Rename batadv_tt_global_entry _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:51:01 +08:00
Sven Eckelmann	95c0db90c7	batman-adv: Rename batadv_tt_local_entry _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:51:00 +08:00
Sven Eckelmann	21754e2501	batman-adv: Rename batadv_orig_node_vlan _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:50:58 +08:00
Sven Eckelmann	5fff28255f	batman-adv: Rename batadv_nc_path _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:11 +08:00
Sven Eckelmann	27ad7545bb	batman-adv: Rename batadv_nc_node _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:11 +08:00
Sven Eckelmann	9c3bf08189	batman-adv: Rename batadv_softif_vlan _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:10 +08:00
Sven Eckelmann	4a13147cb5	batman-adv: Rename batadv_tvlv_container _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:10 +08:00
Sven Eckelmann	ba610043af	batman-adv: Rename batadv_tvlv_handler _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:09 +08:00
Sven Eckelmann	3a01743ddf	batman-adv: Rename batadv_gw_node _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:49:06 +08:00
Sven Eckelmann	a6416f9ffc	batman-adv: Rename batadv_dat_entry _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:28 +08:00
Sven Eckelmann	321e3e0884	batman-adv: Rename batadv_claim _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:27 +08:00
Sven Eckelmann	c8b86c1241	batman-adv: Rename batadv_backbone_gw _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:27 +08:00
Sven Eckelmann	accadc35a1	batman-adv: Rename batadv_hardif_neigh _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:26 +08:00
Sven Eckelmann	35f94779c9	batman-adv: Rename batadv_orig_ifinfo _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:26 +08:00
Sven Eckelmann	044fa3ae12	batman-adv: Rename batadv_neigh_ifinfo _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:25 +08:00
Sven Eckelmann	25bb250996	batman-adv: Rename batadv_neigh_node _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:25 +08:00
Sven Eckelmann	82047ad7fe	batman-adv: Rename batadv_hardif _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:24 +08:00
Sven Eckelmann	5d9673109c	batman-adv: Rename batadv_orig_node _free_ref function to _put The batman-adv source code is the only place in the kernel which uses the _free_ref naming scheme for the _put functions. Changing it to _put makes it more consistent and makes it easier to understand the connection to the _get functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-23 13:48:24 +08:00
Antonio Quartulli	e9ccd7e39c	batman-adv: remove unused BATADV_BONDING_TQ_THRESHOLD constant BATADV_BONDING_TQ_THRESHOLD is not used anymore since the implementation of the bat_neigh_is_similar_or_better() API function. Such function uses the more generic BATADV_TQ_SIMILARITY_THRESHOLD constant. Therefore, remove definition of the unused BATADV_BONDING_TQ_THRESHOLD constant. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2016-02-23 13:48:23 +08:00
David S. Miller	b633353115	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-23 00:09:14 -05:00
David S. Miller	9ca69b7054	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2016-02-20 Here's an important patch for 4.5 which fixes potential invalid pointer access when processing completed Bluetooth HCI commands. Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:46:26 -05:00
Daniel Borkmann	6205b9cf20	bpf: don't emit mov A,A on return While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax', and found that this comes from BPF_RET \| BPF_A translations from classic BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0 already, therefore only emit a mov when immediates are used as return value. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:07:11 -05:00
Daniel Borkmann	2f72959a9c	bpf: fix csum update in bpf_l4_csum_replace helper for udp When using this helper for updating UDP checksums, we need to extend this in order to write CSUM_MANGLED_0 for csum computations that result into 0 as sum. Reason we need this is because packets with a checksum could otherwise become incorrectly marked as a packet without a checksum. Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should not turn packets without a checksum into ones with a checksum. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:07:10 -05:00
Daniel Borkmann	3697649ff2	bpf: try harder on clones when writing into skb When we're dealing with clones and the area is not writeable, try harder and get a copy via pskb_expand_head(). Replace also other occurences in tc actions with the new skb_try_make_writable(). Reported-by: Ashhad Sheikh <ashhadsheikh394@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:07:10 -05:00
Daniel Borkmann	21cafc1dc2	bpf: remove artificial bpf_skb_{load, store}_bytes buffer limitation We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes() helpers to only store or load a maximum buffer of 16 bytes. Thus, loading, rewriting and storing headers require several bpf_skb_load_bytes() and bpf_skb_store_bytes() calls. Also here we can use a per-cpu scratch buffer instead in order to not pressure stack space any further. I do suspect that this limit was mainly set in place for this particular reason. So, ease program development by removing this limitation and make the scratchpad generic, so it can be reused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:07:10 -05:00
Daniel Borkmann	7d672345ed	bpf: add generic bpf_csum_diff helper For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's currently limited to handle 2 and 4 byte changes in a header and feeds the from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When working with IPv6, for example, this makes it rather cumbersome to deal with, similarly when editing larger parts of a header. Instead, extend the API in a more generic way: For bpf_l4_csum_replace(), add a case for header field mask of 0 to change the checksum at a given offset through inet_proto_csum_replace_by_diff(), and provide a helper bpf_csum_diff() that can generically calculate a from/to diff for arbitrary amounts of data. This can be used in multiple ways: for the bpf_l4_csum_replace() only part, this even provides us with the option to insert precalculated diffs from user space f.e. from a map, or from bpf_csum_diff() during runtime. bpf_csum_diff() has a optional from/to stack buffer input, so we can calculate a diff by using a scratchbuffer for scenarios where we're inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers don't need to be of equal size) data. Also, bpf_csum_diff() allows to feed a previous csum into csum_partial(), so the function can also be cascaded. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:07:09 -05:00
Robert Shearman	84a8cbe46a	ila: autoload module Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:00:28 -05:00
Robert Shearman	b2b04edceb	mpls: autoload lwt module Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:00:28 -05:00
Robert Shearman	745041e2aa	lwtunnel: autoload of lwt modules The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Only users with the CAP_NET_ADMIN capability can cause modules to be loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx messages for users without this capability, and by lwtunnel_build_state not being called in response to RTM_GETxxx messages. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 22:00:28 -05:00
Zhang Shengju	e817af27e0	vlan: turn on unicast filtering on vlan device Currently vlan device inherits unicast filtering flag from underlying device. If underlying device doesn't support unicast filter, this will put vlan device into promiscuous mode when it's stacked. Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does not go into promiscuous mode needlessly. If underlying device does not support unicast filtering, that device will enter promiscuous mode. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 21:54:05 -05:00
Neil Horman	d9749fb594	sctp: Fix port hash table size computation Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in its size computation, observing that the current method never guaranteed that the hashsize (measured in number of entries) would be a power of two, which the input hash function for that table requires. The root cause of the problem is that two values need to be computed (one, the allocation order of the storage requries, as passed to __get_free_pages, and two the number of entries for the hash table). Both need to be ^2, but for different reasons, and the existing code is simply computing one order value, and using it as the basis for both, which is wrong (i.e. it assumes that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not). To fix this, we change the logic slightly. We start by computing a goal allocation order (which is limited by the maximum size hash table we want to support. Then we attempt to allocate that size table, decreasing the order until a successful allocation is made. Then, with the resultant successful order we compute the number of buckets that hash table supports, which we then round down to the nearest power of two, giving us the number of entries the table actually supports. I've tested this locally here, using non-debug and spinlock-debug kernels, and the number of entries in the hashtable consistently work out to be powers of two in all cases. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> CC: Dmitry Vyukov <dvyukov@google.com> CC: Vladislav Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-21 21:52:51 -05:00
Douglas Anderson	3bd7594e69	Bluetooth: hci_core: Avoid mixing up req_complete and req_complete_skb In commit `44d2713774` ("Bluetooth: Compress the size of struct hci_ctrl") we squashed down the size of the structure by using a union with the assumption that all users would use the flag to determine whether we had a req_complete or a req_complete_skb. Unfortunately we had a case in hci_req_cmd_complete() where we weren't looking at the flag. This can result in a situation where we might be storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or vice versa. During some testing I found at least one case where the function hci_req_sync_complete() was called improperly because the kernel thought that it didn't require an SKB. Looking through the stack in kgdb I found that it was called by hci_event_packet() and that hci_event_packet() had both of its locals "req_complete" and "req_complete_skb" pointing to the same place: both to hci_req_sync_complete(). Let's make sure we always check the flag. For more details on debugging done, see <http://crbug.com/588288>. Fixes: `44d2713774` ("Bluetooth: Compress the size of struct hci_ctrl") Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-02-20 08:52:28 +01:00
Rainer Weikusat	18eceb818d	af_unix: Don't use continue to re-execute unix_stream_read_generic loop The unix_stream_read_generic function tries to use a continue statement to restart the receive loop after waiting for a message. This may not work as intended as the caller might use a recvmsg call to peek at control messages without specifying a message buffer. If this was the case, the continue will cause the function to return without an error and without the credential information if the function had to wait for a message while it had returned with the credentials otherwise. Change to using goto to restart the loop without checking the condition first in this case so that credentials are returned either way. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 23:50:31 -05:00
Dmitry V. Levin	b5f0549231	unix_diag: fix incorrect sign extension in unix_lookup_by_ino The value passed by unix_diag_get_exact to unix_lookup_by_ino has type __u32, but unix_lookup_by_ino's argument ino has type int, which is not a problem yet. However, when ino is compared with sock_i_ino return value of type unsigned long, ino is sign extended to signed long, and this results to incorrect comparison on 64-bit architectures for inode numbers greater than INT_MAX. This bug was found by strace test suite. Fixes: `5d3cae8bc3` ("unix_diag: Dumping exact socket core") Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 23:49:23 -05:00
Daniel Borkmann	6b83d28a55	net: use skb_postpush_rcsum instead of own implementations Replace individual implementations with the recently introduced skb_postpush_rcsum() helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tom Herbert <tom@herbertland.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 23:43:10 -05:00
Kan Liang	f38d138a7d	net/ethtool: support set coalesce per queue This patch implements sub command ETHTOOL_SCOALESCE for ioctl ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to set coalesce of each masked queue to device driver. The wanted coalesce information are stored in "data" for each masked queue, which can copy from userspace. If it fails to set coalesce to device driver, the value which already set to specific queue will be tried to rollback. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 22:54:10 -05:00
Kan Liang	421797b1aa	net/ethtool: support get coalesce per queue This patch implements sub command ETHTOOL_GCOALESCE for ioctl ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to get coalesce of each masked queue from device driver. Then the interrupt coalescing parameters will be copied back to user space one by one. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 22:54:09 -05:00
Kan Liang	ac2c7ad0e5	net/ethtool: introduce a new ioctl for per queue setting Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting. The following patches will enable some SUB_COMMANDs for per queue setting. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 22:54:09 -05:00
Wei Wang	e0d8c1b738	ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6 In ipv4, when the machine receives a ICMP_FRAG_NEEDED message, the connected UDP socket will get EMSGSIZE message on its next read from the socket. However, this is not the case for ipv6. This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to make it similar to ipv4 behavior. That is when the machine gets an ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE message on its next read from the socket. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:46:24 -05:00
Paolo Abeni	c868ee7063	lwt: fix rx checksum setting for lwt devices tunneling over ipv6 the commit `35e2d1152b` ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.") changed the default xmit checksum setting for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not set into external UDP header. This commit changes the rx checksum setting for both lwt vxlan/geneve devices created by openvswitch accordingly, so that lwt over ipv6 tunnel pairs are again able to communicate with default values. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Benc <jbenc@redhat.com> Acked-by: Jesse Gross <jesse@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:39:30 -05:00
Insu Yun	b53ce3e7d4	tipc: unlock in error path tipc_bcast_unlock need to be unlocked in error path. Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:38:44 -05:00
David S. Miller	29d1441dfc	Two of the fixes included in this patchset prevent wrong memory access - it was triggered when removing an object from a list after it was already free'd due to bad reference counting. This misbehaviour existed for both the gw_node and the orig_node_vlan object and has been fixed by Sven Eckelmann. The last patch fixes our interface feasibility check and prevents it from looping indefinitely when two net_device objects reference each other via iflink index (i.e. veth pair), by Andrew Lunn -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWwy9MAAoJENpFlCjNi1MRu+cQAMF7zYIlWggCWUEqaEIId4qi daGly3DNrXWH8FJta+NR399xXlTXNoyS8uZYt9YUZrXlwXDbDNBHJ8Jcmc5sslZr /06os5KUvNLlasup1GLBbynKUCTBSjVyo3DBwVCsiCL/iHdN1tFZXg3c2ET2U7Tk 4bFZO4h1JkpwCYeCjCJQSbPrjnAXYdxV01iarDJZgAoc290f92Ob4mTuTrQr1Vzi EH2fyrd3m/qmRb2YBj7jNdVeD2Q4XyMKY4vHHeBq9JR/kwINwUg38+gnvcwK+KeD VJilETdC686ZJKBSVvECQZqeXr4HaCINFNuepZLsg128IQ2ZjqPYC1yTq/qc1VF+ YGCunXhbbNgxeE1+QgXssgRAiUBx43m2osQegpHkyCzK8GPbv1gRIbTYaHgmqa4A Pn6WHdV2kcfGRFNOASST5+eWPe9ol8RvAbXT0yD2+6CNz6aijfbGITUTlB24X+f+ 6CsS3uCBsH8FI3hqjgULTqcyF14QbnLZ9AnkxYwt8B0Ge62LEfdwuXAh96mIhfyf cGi7DBnHAYapOPzz0jSBhW5myiPOioNLJfzokZrBq+RuRBYUIUvzGmz8QJQpy/GO EdnMOE/uWeZeSCsAGTE00pjqfHliK/vu4Wh0puGFWTQGCBbKWP5p4eNqRh4p+2GC Sg1f1EKDt2PRw9aZHNHm =CvI3 -----END PGP SIGNATURE----- Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge Antonio Quartulli says: ==================== Two of the fixes included in this patchset prevent wrong memory access - it was triggered when removing an object from a list after it was already free'd due to bad reference counting. This misbehaviour existed for both the gw_node and the orig_node_vlan object and has been fixed by Sven Eckelmann. The last patch fixes our interface feasibility check and prevents it from looping indefinitely when two net_device objects reference each other via iflink index (i.e. veth pair), by Andrew Lunn ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:35:29 -05:00
Anton Protopopov	a97eb33ff2	rtnl: RTM_GETNETCONF: fix wrong return value An error response from a RTM_GETNETCONF request can return the positive error value EINVAL in the struct nlmsgerr that can mislead userspace. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:33:46 -05:00
Nikolay Aleksandrov	cfdd28beb3	net: make netdev_for_each_lower_dev safe for device removal When I used netdev_for_each_lower_dev in commit `bad5316232` ("vrf: remove slave queue and private slave struct") I thought that it acts like netdev_for_each_lower_private and can be used to remove the current device from the list while walking, but unfortunately it acts more like netdev_for_each_lower_private_rcu and doesn't allow it. The difference is where the "iter" points to, right now it points to the current element and that makes it impossible to remove it. Change the logic to be similar to netdev_for_each_lower_private and make it point to the "next" element so we can safely delete the current one. VRF is the only such user right now, there's no change for the read-only users. Here's what can happen now: [98423.249858] general protection fault: 0000 [#1] SMP [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button 9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy virtio_ring virtio scsi_mod [last unloaded: bridge] [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G O 4.5.0-rc2+ #81 [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti: ffff88003428c000 [98423.256123] RIP: 0010:[<ffffffff81514f3e>] [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30 [98423.256534] RSP: 0018:ffff88003428f940 EFLAGS: 00010207 [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX: 0000000000000000 [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI: ffff880054ff90c0 [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09: 0000000000000000 [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88003428f9e0 [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15: 0000000000000001 [98423.258055] FS: 00007f3d76881700(0000) GS:ffff88005d000000(0000) knlGS:0000000000000000 [98423.258418] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4: 00000000000406e0 [98423.258902] Stack: [98423.259075] ffff88003428f960 ffffffffa0442636 0002000100000004 ffff880054ff9000 [98423.259647] ffff88003428f9b0 ffffffff81518205 ffff880054ff9000 ffff88003428f978 [98423.260208] ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0 ffff880035b35f00 [98423.260739] Call Trace: [98423.260920] [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf] [98423.261156] [<ffffffff81518205>] rollback_registered_many+0x205/0x390 [98423.261401] [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70 [98423.261641] [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50 [98423.271557] [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0 [98423.271800] [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90 [98423.272049] [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200 [98423.272279] [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10 [98423.272513] [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40 [98423.272755] [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40 [98423.272983] [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0 [98423.273209] [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40 [98423.273476] [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0 [98423.273710] [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610 [98423.273947] [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70 [98423.274175] [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0 [98423.274416] [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140 [98423.274658] [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210 [98423.274894] [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210 [98423.275130] [<ffffffff81269611>] ? __fget_light+0x91/0xb0 [98423.275365] [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80 [98423.275595] [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20 [98423.275827] [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48 89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 [98423.279639] RIP [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30 [98423.279920] RSP <ffff88003428f940> CC: David Ahern <dsa@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Vlad Yasevich <vyasevic@redhat.com> Fixes: `bad5316232` ("vrf: remove slave queue and private slave struct") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:29:26 -05:00
Nikolay Aleksandrov	2125715635	bridge: mdb: add support for more attributes and export timer Currently mdb entries are exported directly as a structure inside MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without breaking user-space. In order to export new mdb fields, I've converted the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before with struct br_mdb_entry (without header, as it's casted directly in iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we keep compatibility with older users and can export new data. I've tested this with iproute2, both with and without support for the added attribute and it works fine. So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry inside but it may contain also some additional MDBA_MDB_EATTR_ attributes such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space. So the new structure is: [MDBA_MDB] = { [MDBA_MDB_ENTRY] = { [MDBA_MDB_ENTRY_INFO] [MDBA_MDB_ENTRY_INFO] { <- Nested attribute struct br_mdb_entry <- nla_put_nohdr() [MDBA_MDB_ENTRY attributes] <- normal netlink attributes } } } Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:27:36 -05:00
Nikolay Aleksandrov	76cc173d48	bridge: mdb: reduce the indentation level in br_mdb_fill_info Switch the port check and skip if it's null, this allows us to reduce one indentation level. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:27:36 -05:00
Trond Myklebust	d07fbb8fdf	NFS: NFSoRDMA Client Bugfix This patch fixes a bug where NFS v4.1 callbacks were returning RPC_GARBAGE_ARGS to the server. Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWxO+oAAoJENfLVL+wpUDrTwMQAPb1meyxjiRAWbZJPF/GO0oE R62K7y6Y7fJ8MtOXWO+cyhZ54sRaGflTm233KS9L9ICY0acLU0fcArO43QLDH5p9 hozOJso2C7Ti2AVcgpLRNAK//DyBe8zzSU4qNGdLGSccKJcl97oUuI+18QpeWS8p 4dbscyx6pYxwTBUrvihLSUWcONge+uKYHHOriZ/r/sno+4wHNY6e11xeseMqmHQx RFZOvsZakIoXsBW1UEZZsSpY0GnOTTTWZlk0rOb5ZyaCU+/sZygxJSHOSri66v8h vPp7zx2HUGGpeMNRXlRgWkiJ2B+pShJ7JH7kfq5ZyMoy1uEeMEw3cdVoG+/OSBaT EjOWbuj2Npq30GJAdXYE9joQ+Pd2d3h8q2/qzucpQh89hoGcf5jwgYrdyUgzHOdA Lnsvhv5Wm8QiKO8LxNtnNA607dsDI5FLus+ijNl0bVFT12MYS/2ge679armCh5tA eTJCp0ARKzWsoVRCCkGiPjmGg9yCTP73uNm/EXr4HLKfzSIbrZLK1Aswe9NS1+/L tLVJsxbjRKfm9Ovvs7fl0lB7S4fbMKW78PO+vUuNfaRbgtha7SexvTp4g3am0Plh 2JnOgUqeG6My5Q9z843iG7vBD5U7jMId8dD37IlkXNMeiav7xBo1MQIokUDfpuXo pbzSYIpdnIiPkSvAU3L2 =7AXa -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-4.5-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: NFSoRDMA Client Bugfix This patch fixes a bug where NFS v4.1 callbacks were returning RPC_GARBAGE_ARGS to the server. Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>	2016-02-18 20:22:03 -05:00
Anton Protopopov	449f14f01f	net: caif: fix erroneous return value The cfrfml_receive() function might return positive value EPROTO Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:59:35 -05:00
Anton Protopopov	48bb230e87	appletalk: fix erroneous return value The atalk_sendmsg() function might return wrong value ENETUNREACH instead of -ENETUNREACH. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:59:34 -05:00
Phil Sutter	a813104d92	IFF_NO_QUEUE: Fix for drivers not calling ether_setup() My implementation around IFF_NO_QUEUE driver flag assumed that leaving tx_queue_len untouched (specifically: not setting it to zero) by drivers would make it possible to assign a regular qdisc to them without having to worry about setting tx_queue_len to a useful value. This was only partially true: I overlooked that some drivers don't call ether_setup() and therefore not initialize tx_queue_len to the default value of 1000. Consequently, removing the workarounds in place for that case in qdisc implementations which cared about it (namely, pfifo, bfifo, gred, htb, plug and sfb) leads to problems with these specific interface types and qdiscs. Luckily, there's already a sanitization point for drivers setting tx_queue_len to zero, which can be reused to assign the fallback value most qdisc implementations used, which is 1. Fixes: `348e3435cb` ("net: sched: drop all special handling of tx_queue_len == 0") Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:56:53 -05:00
Jiri Benc	d13b161c2c	gre: clear IFF_TX_SKB_SHARING ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre as it modifies the skb on xmit. Also, clean up whitespace in ipgre_tap_setup when we're already touching it. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:43:48 -05:00
Jiri Benc	7f290c9435	iptunnel: scrub packet in iptunnel_pull_header Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call skb_scrub_packet directly instead. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:34:54 -05:00
Vivien Didelot	7c25b16dbb	net: bridge: log port STP state on change Remove the shared br_log_state function and print the info directly in br_set_state, where the net_bridge_port state is actually changed. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 14:20:08 -05:00
Florian Westphal	c5b0db3263	nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" reverts commit `3ab1f683bf` ("nfnetlink: add support for memory mapped netlink")' Like previous commits in the series, remove wrappers that are not needed after mmapped netlink removal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:42:22 -05:00
Florian Westphal	905f0a739a	nfnetlink: remove nfnetlink_alloc_skb Following mmapped netlink removal this code can be simplified by removing the alloc wrapper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:42:19 -05:00
Florian Westphal	263ea09084	Revert "genl: Add genlmsg_new_unicast() for unicast message allocation" This reverts commit `bb9b18fb55` ("genl: Add genlmsg_new_unicast() for unicast message allocation")'. Nothing wrong with it; its no longer needed since this was only for mmapped netlink support. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:42:19 -05:00
Florian Westphal	551ddc057e	openvswitch: Revert: "Enable memory mapped Netlink i/o" revert commit `795449d8b8` ("openvswitch: Enable memory mapped Netlink i/o"). Following the mmaped netlink removal this code can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:42:18 -05:00
Florian Westphal	d1b4c689d4	netlink: remove mmapped netlink support mmapped netlink has a number of unresolved issues: - TX zerocopy support had to be disabled more than a year ago via commit `4682a03586` ("netlink: Always copy on mmap TX.") because the content of the mmapped area can change after netlink attribute validation but before message processing. - RX support was implemented mainly to speed up nfqueue dumping packet payload to userspace. However, since commit `ae08ce0021` ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy with the socket-based interface too (via the skb_zerocopy helper). The other problem is that skbs attached to mmaped netlink socket behave different from normal skbs: - they don't have a shinfo area, so all functions that use skb_shinfo() (e.g. skb_clone) cannot be used. - reserving headroom prevents userspace from seeing the content as it expects message to start at skb->head. See for instance commit `aa3a022094` ("netlink: not trim skb for mmaped socket when dump"). - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we crash because it needs the sk to check if a tx ring is attached. Also not obvious, leads to non-intuitive bug fixes such as `7c7bdf359` ("netfilter: nfnetlink: use original skbuff when acking batches"). mmaped netlink also didn't play nicely with the skb_zerocopy helper used by nfqueue and openvswitch. Daniel Borkmann fixed this via commit `6bb0fef489` ("netlink, mmap: fix edge-case leakages in nf queue zero-copy")' but at the cost of also needing to provide remaining length to the allocation function. nfqueue also has problems when used with mmaped rx netlink: - mmaped netlink doesn't allow use of nfqueue batch verdict messages. Problem is that in the mmap case, the allocation time also determines the ordering in which the frame will be seen by userspace (A allocating before B means that A is located in earlier ring slot, but this also means that B might get a lower sequence number then A since seqno is decided later. To fix this we would need to extend the spinlocked region to also cover the allocation and message setup which isn't desirable. - nfqueue can now be configured to queue large (GSO) skbs to userspace. Queing GSO packets is faster than having to force a software segmentation in the kernel, so this is a desirable option. However, with a mmap based ring one has to use 64kb per ring slot element, else mmap has to fall back to the socket path (NL_MMAP_STATUS_COPY) for all large packets. To use the mmap interface, userspace not only has to probe for mmap netlink support, it also has to implement a recv/socket receive path in order to handle messages that exceed the size of an rx ring element. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:42:18 -05:00
Eric Dumazet	7716682cc5	tcp/dccp: fix another race at listener dismantle Ilya reported following lockdep splat: kernel: ========================= kernel: [ BUG: held lock freed! ] kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted kernel: ------------------------- kernel: swapper/5/0 is freeing memory ffff880035c9d200-ffff880035c9dbff, with a lock still held there! kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at: [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0 kernel: 4 locks held by swapper/5/0: kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>] netif_receive_skb_internal+0x4b/0x1f0 kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>] ip_local_deliver_finish+0x3f/0x380 kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>] sk_clone_lock+0x19b/0x440 kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at: [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0 To properly fix this issue, inet_csk_reqsk_queue_add() needs to return to its callers if the child as been queued into accept queue. We also need to make sure listener is still there before calling sk->sk_data_ready(), by holding a reference on it, since the reference carried by the child can disappear as soon as the child is put on accept queue. Reported-by: Ilya Dryomov <idryomov@gmail.com> Fixes: `ebb516af60` ("tcp/dccp: fix race at listener dismantle phase") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:35:51 -05:00
Xin Long	deed49df73	route: check and remove route cache when we get route Since the gc of ipv4 route was removed, the route cached would has no chance to be removed, and even it has been timeout, it still could be used, cause no code to check it's expires. Fix this issue by checking and removing route cache when we get route. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:31:36 -05:00
Jamal Hadi Salim	7e6e18fbc0	net_sched: Improve readability of filter processing Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:19:12 -05:00
Ido Schimmel	7fbac984f3	bridge: switchdev: Offload VLAN flags to hardware bridge When VLANs are created / destroyed on a VLAN filtering bridge (MASTER flag set), the configuration is passed down to the hardware. However, when only the flags (e.g. PVID) are toggled, the configuration is done in the software bridge alone. While it is possible to pass these flags to hardware when invoked with the SELF flag set, this creates inconsistency with regards to the way the VLANs are initially configured. Pass the flags down to the hardware even when the VLAN already exists and only the flags are toggled. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:18:11 -05:00
Jamal Hadi Salim	619fe32640	net_sched fix: reclassification needs to consider ether protocol changes actions could change the etherproto in particular with ethernet tunnelled data. Typically such actions, after peeling the outer header, will ask for the packet to be reclassified. We then need to restart the classification with the new proto header. Example setup used to catch this: sudo tc qdisc add dev $ETH ingress sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \ u32 match u32 0 0 flowid 1:1 \ action vlan pop reclassify Fixes: `3b3ae88026` ("net: sched: consolidate tc_classify{,_compat}") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 11:14:19 -05:00
Colin Ian King	fbbef866fc	net-sysfs: remove unused fmt_long_hex Ever since commit `04ed3e741d` ("net: change netdev->features to u32") the format string fmt_long_hex has not been used, so we may as well remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-18 10:01:15 -05:00
Insu Yun	1eea84b74c	tcp: correctly crypto_alloc_hash return check crypto_alloc_hash never returns NULL Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 22:23:04 -05:00
Zhang Shengju	e4999f256c	vlan: change return type of vlan_proc_rem_dev Since function vlan_proc_rem_dev() will only return 0, it's better to return void instead of int. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 22:06:28 -05:00
Florian Fainelli	73dcb55653	net: dsa: Unregister slave_dev in error path With commit `0071f56e46` ("dsa: Register netdev before phy"), we are now trying to free a network device that has been previously registered, and in case of errors, this will make us hit the BUG_ON(dev->reg_state != NETREG_UNREGISTERED) condition. Fix this by adding a missing unregister_netdev() before free_netdev(). Fixes: `0071f56e46` ("dsa: Register netdev before phy") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 22:05:16 -05:00
Arnd Bergmann	f6ca9f46f6	netfilter: ipvs: avoid unused variable warnings The proc_create() and remove_proc_entry() functions do not reference their arguments when CONFIG_PROC_FS is disabled, so we get a couple of warnings about unused variables in IPVS: ipvs/ip_vs_app.c:608:14: warning: unused variable 'net' [-Wunused-variable] ipvs/ip_vs_ctl.c:3950:14: warning: unused variable 'net' [-Wunused-variable] ipvs/ip_vs_ctl.c:3994:14: warning: unused variable 'net' [-Wunused-variable] This removes the local variables and instead looks them up separately for each use, which obviously avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 4c50a8ce2b63 ("netfilter: ipvs: avoid unused variable warning") Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2016-02-18 09:17:58 +09:00
Yannick Brosseau	2d9e9b0d05	netfilter: ipvs: Remove noisy debug print from ip_vs_del_service This have been there for a long time, but does not seem to add value Signed-off-by: Yannick Brosseau <scientist@fb.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2016-02-18 09:17:58 +09:00
Ben Hutchings	7bbf3cae65	ipv4: Remove inet_lro library There are no longer any in-tree drivers that use it. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 16:15:46 -05:00
One Thousand Gnomes	82aaf4fcbe	af_llc: fix types on llc_ui_wait_for_conn The timeout is a long, we return it truncated if it is huge. Basically harmless as the only caller does a boolean check, but tidy it up anyway. (64bit build tested this time. Thank you 0day) Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 16:12:13 -05:00
Xin Long	1cd4d5c432	sctp: remove the unused sctp_datamsg_free() Since commit `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg") used sctp_datamsg_put in sctp_sendmsg, instead of sctp_datamsg_free, this function has no use in sctp. So we will remove it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 15:41:54 -05:00
Xin Long	ac1efde802	sctp: remove rcu_read_lock in sctp_seq_dump_remote_addrs() sctp_seq_dump_remote_addrs is only called by sctp_assocs_seq_show() and it has been protected by rcu_read_lock that is from rhashtable_walk_start(). So we will remove this one. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 15:41:54 -05:00
Xin Long	f46c7011b0	sctp: move rcu_read_lock from __sctp_lookup_association to sctp_lookup_association __sctp_lookup_association() is only invoked by sctp_v4_err() and sctp_rcv(), both which run on the rx BH, and it has been protected by rcu_read_lock [see ip_local_deliver_finish() / ipv6_rcv()]. So we can move it to sctp_lookup_association, only let sctp_lookup_association use rcu_read_lock. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 15:41:54 -05:00
Mark Tomlinson	853effc55b	l2tp: Fix error creating L2TP tunnels A previous commit (`33f72e6`) added notification via netlink for tunnels when created/modified/deleted. If the notification returned an error, this error was returned from the tunnel function. If there were no listeners, the error code ESRCH was returned, even though having no listeners is not an error. Other calls to this and other similar notification functions either ignore the error code, or filter ESRCH. This patch checks for ESRCH and does not flag this as an error. Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 15:34:47 -05:00
Rosen, Rami	bd4508e850	core: remove unneded headers for net cgroup controllers. commit `3ed80a6` (cgroup: drop module support) made including module.h redundant in the net cgroup controllers, netclassid_cgroup.c and netprio_cgroup.c. This patch removes them. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 15:31:27 -05:00
Scott Mayhew	437b300c6b	auth_gss: fix panic in gss_pipe_downcall() in fips mode On Mon, 15 Feb 2016, Trond Myklebust wrote: > Hi Scott, > > On Mon, Feb 15, 2016 at 2:28 PM, Scott Mayhew <smayhew@redhat.com> wrote: > > md5 is disabled in fips mode, and attempting to import a gss context > > using md5 while in fips mode will result in crypto_alg_mod_lookup() > > returning -ENOENT, which will make its way back up to > > gss_pipe_downcall(), where the BUG() is triggered. Handling the -ENOENT > > allows for a more graceful failure. > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > --- > > net/sunrpc/auth_gss/auth_gss.c \| 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c > > index 799e65b..c30fc3b 100644 > > --- a/net/sunrpc/auth_gss/auth_gss.c > > +++ b/net/sunrpc/auth_gss/auth_gss.c > > @@ -737,6 +737,9 @@ gss_pipe_downcall(struct file filp, const char __user src, size_t mlen) > > case -ENOSYS: > > gss_msg->msg.errno = -EAGAIN; > > break; > > + case -ENOENT: > > + gss_msg->msg.errno = -EPROTONOSUPPORT; > > + break; > > default: > > printk(KERN_CRIT "%s: bad return from " > > "gss_fill_context: %zd\n", __func__, err); > > -- > > 2.4.3 > > > > Well debugged, but I unfortunately do have to ask if this patch is > sufficient? In addition to -ENOENT, and -ENOMEM, it looks to me as if > crypto_alg_mod_lookup() can also fail with -EINTR, -ETIMEDOUT, and > -EAGAIN. Don't we also want to handle those? You're right, I was focusing on the panic that I could easily reproduce. I'm still not sure how I could trigger those other conditions. > > In fact, peering into the rats nest that is > gss_import_sec_context_kerberos(), it looks as if that is just a tiny > subset of all the errors that we might run into. Perhaps the right > thing to do here is to get rid of the BUG() (but keep the above > printk) and just return a generic error? That sounds fine to me -- updated patch attached. -Scott >From d54c6b64a107a90a38cab97577de05f9a4625052 Mon Sep 17 00:00:00 2001 From: Scott Mayhew <smayhew@redhat.com> Date: Mon, 15 Feb 2016 15:12:19 -0500 Subject: [PATCH] auth_gss: remove the BUG() from gss_pipe_downcall() Instead return a generic error via gss_msg->msg.errno. None of the errors returned by gss_fill_context() should necessarily trigger a kernel panic. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-02-17 11:50:10 -05:00
Chuck Lever	9f74660bcf	xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len Some NFSv4.1 OPEN requests were hanging waiting for the NFS server to finish recalling delegations. Turns out that each NFSv4.1 CB request on RDMA gets a GARBAGE_ARGS reply from the Linux client. Commit `756b9b37cf` added a line in bc_svc_process that overwrites the incoming rq_rcv_buf's length with the value in rq_private_buf.len. But rpcrdma_bc_receive_call() does not invoke xprt_complete_bc_request(), thus rq_private_buf.len is not initialized. svc_process_common() is invoked with a zero-length RPC message, and fails. Fixes: `756b9b37cf` ('SUNRPC: Fix callback channel') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-02-17 10:23:52 -05:00
John Fastabend	1c78c64e9c	net: add tc offload feature flag Its useful to turn off the qdisc offload feature at a per device level. This gives us a big hammer to enable/disable offloading. More fine grained control (i.e. per rule) may be supported later. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 09:47:36 -05:00
John Fastabend	a1b7c5fd7f	net: sched: add cls_u32 offload hooks for netdevs This patch allows netdev drivers to consume cls_u32 offloads via the ndo_setup_tc ndo op. This works aligns with how network drivers have been doing qdisc offloads for mqprio. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 09:47:36 -05:00
John Fastabend	16e5cc6471	net: rework setup_tc ndo op to consume general tc operand This patch updates setup_tc so we can pass additional parameters into the ndo op in a generic way. To do this we provide structured union and type flag. This lets each classifier and qdisc provide its own set of attributes without having to add new ndo ops or grow the signature of the callback. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 09:47:35 -05:00
John Fastabend	e4c6734eaa	net: rework ndo tc op to consume additional qdisc handle parameter The ndo_setup_tc() op was added to support drivers offloading tx qdiscs however only support for mqprio was ever added. So we only ever added support for passing the number of traffic classes to the driver. This patch generalizes the ndo_setup_tc op so that a handle can be provided to indicate if the offload is for ingress or egress or potentially even child qdiscs. CC: Murali Karicheri <m-karicheri2@ti.com> CC: Shradha Shah <sshah@solarflare.com> CC: Or Gerlitz <ogerlitz@mellanox.com> CC: Ariel Elior <ariel.elior@qlogic.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Bruce Allan <bruce.w.allan@intel.com> CC: Jesse Brandeburg <jesse.brandeburg@intel.com> CC: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-17 09:47:35 -05:00
Nikolay Borisov	52a773d645	net: Export ip fragment sysctl to unprivileged users Now that all the ip fragmentation related sysctls are namespaceified there is no reason to hide them anymore from "root" users inside containers. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Nikolay Borisov	0fbf4cb27e	ipv4: namespacify ip fragment max dist sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Nikolay Borisov	e21145a987	ipv4: namespacify ip_early_demux sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Nikolay Borisov	287b7f38fd	ipv4: Namespacify ip_dynaddr sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Nikolay Borisov	dcd87999d4	igmp: net: Move igmp namespace init to correct file When igmp related sysctl were namespacified their initializatin was erroneously put into the tcp socket namespace constructor. This patch moves the relevant code into the igmp namespace constructor to keep things consistent. Also sprinkle some #ifdefs to silence warnings Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Nikolay Borisov	fa50d974d1	ipv4: Namespaceify ip_default_ttl sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:42:54 -05:00
Eric Dumazet	cd9b266095	tcp: add tcpi_min_rtt and tcpi_notsent_bytes to tcp_info tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow, in usec unit. Might be ~0U if not yet known. tcpi_notsent_bytes reports the amount of bytes in the write queue that were not yet sent. This is done in a single patch to not add a temporary 32bit padding hole in tcp_info. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:27:35 -05:00
Eric Dumazet	729235554d	tcp: md5: release request socket instead of listener If tcp_v4_inbound_md5_hash() returns an error, we must release the refcount on the request socket, not on the listener. The bug was added for IPv4 only. Fixes: `079096f103` ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:24:06 -05:00
Paolo Abeni	3c1cb4d260	net/ipv4: add dst cache support for gre lwtunnels In case of UDP traffic with datagram length below MTU this gives about 4% performance increase Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:21:49 -05:00
Paolo Abeni	d71785ffc7	net: add dst_cache to ovs vxlan lwtunnel In case of UDP traffic with datagram length below MTU this give about 2% performance increase when tunneling over ipv4 and about 60% when tunneling over ipv6 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:21:48 -05:00
Paolo Abeni	e09acddf87	ip_tunnel: replace dst_cache with generic implementation The current ip_tunnel cache implementation is prone to a race that will cause the wrong dst to be cached on cuncurrent dst cache miss and ip tunnel update via netlink. Replacing with the generic implementation fix the issue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:21:48 -05:00
Paolo Abeni	607f725f6f	net: replace dst_cache ip6_tunnel implementation with the generic one This also fix a potential race into the existing tunnel code, which could lead to the wrong dst to be permanenty cached: CPU1: CPU2: <xmit on ip6_tunnel> <cache lookup fails> dst = ip6_route_output(...) <tunnel params are changed via nl> dst_cache_reset() // no effect, // the cache is empty dst_cache_set() // the wrong dst // is permanenty stored // into the cache With the new dst implementation the above race is not possible since the first cache lookup after dst_cache_reset will fail due to the timestamp check Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:21:48 -05:00
Paolo Abeni	911362c70d	net: add dst_cache support This patch add a generic, lockless dst cache implementation. The need for lock is avoided updating the dst cache fields only in per cpu scope, and requiring that the cache manipulation functions are invoked with the local bh disabled. The refresh_ts and reset_ts fields are used to ensure the cache consistency in case of cuncurrent cache update (dst_cache_set*) and reset operation (dst_cache_reset). Consider the following scenario: CPU1: CPU2: <cache lookup with emtpy cache: it fails> <get dst via uncached route lookup> <related configuration changes> dst_cache_reset() dst_cache_set() The dst entry set passed to dst_cache_set() should not be used for later dst cache lookup, because it's obtained using old configuration values. Since the refresh_ts is updated only on dst_cache lookup, the cached value in the above scenario will be discarded on the next lookup. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 20:21:48 -05:00
Eric Dumazet	372022830b	tcp: do not set rtt_min to 1 There are some cases where rtt_us derives from deltas of jiffies, instead of using usec timestamps. Since we want to track minimal rtt, better to assume a delta of 0 jiffie might be in fact be very close to 1 jiffie. It is kind of sad jiffies_to_usecs(1) calls a function instead of simply using a constant. Fixes: `f672258391` ("tcp: track min RTT using windowed min-filter") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 16:07:13 -05:00
Sascha Hauer	a407054f83	net: dsa: remove phy_disconnect from error path The phy has not been initialized, disconnecting it in the error path results in a NULL pointer exception. Drop the phy_disconnect from the error path. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 16:04:08 -05:00
Richard Alpe	4952cd3e7b	tipc: refactor node xmit and fix memory leaks Refactor tipc_node_xmit() to fail fast and fail early. Fix several potential memory leaks in unexpected error paths. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:58:40 -05:00
Jon Paul Maloy	d5c91fb72f	tipc: fix premature addition of node to lookup table In commit `5266698661` ("tipc: let broadcast packet reception use new link receive function") we introduced a new per-node broadcast reception link instance. This link is created at the moment the node itself is created. Unfortunately, the allocation is done after the node instance has already been added to the node lookup hash table. This creates a potential race condition, where arriving broadcast packets are able to find and access the node before it has been fully initialized, and before the above mentioned link has been created. The result is occasional crashes in the function tipc_bcast_rcv(), which is trying to access the not-yet existing link. We fix this by deferring the addition of the node instance until after it has been fully initialized in the function tipc_node_create(). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:57:11 -05:00
Arnd Bergmann	56bb7fd994	bridge: mdb: avoid uninitialized variable warning A recent change to the mdb code confused the compiler to the point where it did not realize that the port-group returned from br_mdb_add_group() is always valid when the function returns a nonzero return value, so we get a spurious warning: net/bridge/br_mdb.c: In function 'br_mdb_add': net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized] __br_mdb_notify(dev, entry, RTM_NEWMDB, pg); Slightly rearranging the code in br_mdb_add_group() makes the problem go away, as gcc is clever enough to see that both functions check for 'ret != 0'. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `9e8430f8d6` ("bridge: mdb: Passing the port-group pointer to br_mdb module") Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:37:28 -05:00
Alexander Duyck	78565208d7	net: Copy inner L3 and L4 headers as unaligned on GRE TEB This patch corrects the unaligned accesses seen on GRE TEB tunnels when generating hash keys. Specifically what this patch does is make it so that we force the use of skb_copy_bits when the GRE inner headers will be unaligned due to NET_IP_ALIGNED being a non-zero value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:25:01 -05:00
Keller, Jacob E	8bf3686204	ethtool: ensure channel counts are within bounds during SCHANNELS Add a sanity check to ensure that all requested channel sizes are within bounds, which should reduce errors in driver implementation. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:19:49 -05:00
Keller, Jacob E	d4ab428627	ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH} Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops incorrectly allow SCHANNELS when it would conflict with the settings from SRXFH. This occurs because it is not possible for drivers to understand whether their Rx flow indirection table has been configured or is in the default state. In addition, drivers currently behave in various ways when increasing the number of Rx channels. Some drivers will always destroy the Rx flow indirection table when this occurs, whether it has been set by the user or not. Other drivers will attempt to preserve the table even if the user has never modified it from the default driver settings. Neither of these situation is desirable because it leads to unexpected behavior or loss of user configuration. The correct behavior is to simply return -EINVAL when SCHANNELS would conflict with the current Rx flow table settings. However, it should only do so if the current settings were modified by the user. If we required that the new settings never conflict with the current (default) Rx flow settings, we would force users to first reduce their Rx flow settings and then reduce the number of Rx channels. This patch proposes a solution implemented in net/core/ethtool.c which ensures that all drivers behave correctly. It checks whether the RXFH table has been configured to non-default settings, and stores this information in a private netdev flag. When the number of channels is requested to change, it first ensures that the current Rx flow table is not going to assign flows to now disabled channels. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 15:19:49 -05:00
David S. Miller	dba36b382e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain a rather large batch for your net that includes accumulated bugfixes, they are: 1) Run conntrack cleanup from workqueue process context to avoid hitting soft lockup via watchdog for large tables. This is required by the IPv6 masquerading extension. From Florian Westphal. 2) Use original skbuff from nfnetlink batch when calling netlink_ack() on error since this needs to access the skb->sk pointer. 3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack resizing. 4) Fix several problems in nfnetlink batch message header sanitization and error handling, from Phil Turnbull. 5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann. 6) Fix wrong signess in return values on nf_tables counter expression, from Anton Protopopov. Due to the NetDev 1.1 organization burden, I had no chance to pass up this to you any sooner in this release cycle, sorry about that. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 12:56:00 -05:00
Rainer Weikusat	a5527dda34	af_unix: Guard against other == sk in unix_dgram_sendmsg The unix_dgram_sendmsg routine use the following test if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { to determine if sk and other are in an n:1 association (either established via connect or by using sendto to send messages to an unrelated socket identified by address). This isn't correct as the specified address could have been bound to the sending socket itself or because this socket could have been connected to itself by the time of the unix_peer_get but disconnected before the unix_state_lock(other). In both cases, the if-block would be entered despite other == sk which might either block the sender unintentionally or lead to trying to unlock the same spin lock twice for a non-blocking send. Add a other != sk check to guard against this. Fixes: `7d267278a9` ("unix: avoid use-after-free in ep_remove_wait_queue") Reported-By: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Tested-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 12:53:35 -05:00
Rainer Weikusat	1b92ee3d03	af_unix: Don't set err in unix_stream_read_generic unless there was an error The present unix_stream_read_generic contains various code sequences of the form err = -EDISASTER; if (<test>) goto out; This has the unfortunate side effect of possibly causing the error code to bleed through to the final out: return copied ? : err; and then to be wrongly returned if no data was copied because the caller didn't supply a data buffer, as demonstrated by the program available at http://pad.lv/1540731 Change it such that err is only set if an error condition was detected. Fixes: `3822b5c2fc` ("af_unix: Revert 'lock_interruptible' in stream receive code") Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-16 12:48:04 -05:00
Andrew Lunn	1bc4e2b000	batman-adv: Avoid endless loop in bat-on-bat netdevice check batman-adv checks in different situation if a new device is already on top of a different batman-adv device. This is done by getting the iflink of a device and all its parent. It assumes that this iflink is always a parent device in an acyclic graph. But this assumption is broken by devices like veth which are actually a pair of two devices linked to each other. The recursive check would therefore get veth0 when calling dev_get_iflink on veth1. And it gets veth0 when calling dev_get_iflink with veth1. Creating a veth pair and loading batman-adv freezes parts of the system ip link add veth0 type veth peer name veth1 modprobe batman-adv An RCU stall will be detected on the system which cannot be fixed. INFO: rcu_sched self-detected stall on CPU 1: (5264 ticks this GP) idle=3e9/140000000000001/0 softirq=144683/144686 fqs=5249 (t=5250 jiffies g=46 c=45 q=43) Task dump for CPU 1: insmod R running task 0 247 245 0x00000008 ffffffff8151f140 ffffffff8107888e ffff88000fd141c0 ffffffff8151f140 0000000000000000 ffffffff81552df0 ffffffff8107b420 0000000000000001 ffff88000e3fa700 ffffffff81540b00 ffffffff8107d667 0000000000000001 Call Trace: <IRQ> [<ffffffff8107888e>] ? rcu_dump_cpu_stacks+0x7e/0xd0 [<ffffffff8107b420>] ? rcu_check_callbacks+0x3f0/0x6b0 [<ffffffff8107d667>] ? hrtimer_run_queues+0x47/0x180 [<ffffffff8107cf9d>] ? update_process_times+0x2d/0x50 [<ffffffff810873fb>] ? tick_handle_periodic+0x1b/0x60 [<ffffffff810290ae>] ? smp_trace_apic_timer_interrupt+0x5e/0x90 [<ffffffff813bbae2>] ? apic_timer_interrupt+0x82/0x90 <EOI> [<ffffffff812c3fd7>] ? __dev_get_by_index+0x37/0x40 [<ffffffffa0031f3e>] ? batadv_hard_if_event+0xee/0x3a0 [batman_adv] [<ffffffff812c5801>] ? register_netdevice_notifier+0x81/0x1a0 [...] This can be avoided by checking if two devices are each others parent and stopping the check in this situation. Fixes: `b7eddd0b39` ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface") Signed-off-by: Andrew Lunn <andrew@lunn.ch> [sven@narfation.org: rewritten description, extracted fix] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-16 22:16:33 +08:00
Sven Eckelmann	3db152093e	batman-adv: Only put orig_node_vlan list reference when removed The batadv_orig_node_vlan reference counter in batadv_tt_global_size_mod can only be reduced when the list entry was actually removed. Otherwise the reference counter may reach zero when batadv_tt_global_size_mod is called from two different contexts for the same orig_node_vlan but only one context is actually removing the entry from the list. The release function for this orig_node_vlan is not called inside the vlan_list_lock spinlock protected region because the function batadv_tt_global_size_mod still holds a orig_node_vlan reference for the object pointer on the stack. Thus the actual release function (when required) will be called only at the end of the function. Fixes: `7ea7b4a142` ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-16 17:52:26 +08:00
Sven Eckelmann	c18bdd018e	batman-adv: Only put gw_node list reference when removed The batadv_gw_node reference counter in batadv_gw_node_update can only be reduced when the list entry was actually removed. Otherwise the reference counter may reach zero when batadv_gw_node_update is called from two different contexts for the same gw_node but only one context is actually removing the entry from the list. The release function for this gw_node is not called inside the list_lock spinlock protected region because the function batadv_gw_node_update still holds a gw_node reference for the object pointer on the stack. Thus the actual release function (when required) will be called only at the end of the function. Fixes: `bd3524c14b` ("batman-adv: remove obsolete deleted attribute for gateway node") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-16 17:52:25 +08:00
Laura Abbott	5988818008	vsock: Fix blocking ops call in prepare_to_wait We receoved a bug report from someone using vmware: WARNING: CPU: 3 PID: 660 at kernel/sched/core.c:7389 __might_sleep+0x7d/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810fa68d>] prepare_to_wait+0x2d/0x90 Modules linked in: vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event snd_ens1371 iosf_mbi gameport snd_rawmidi snd_ac97_codec ac97_bus snd_seq coretemp snd_seq_device snd_pcm snd_timer snd soundcore ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_vmci vmw_balloon i2c_piix4 shpchp parport_pc parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc btrfs xor raid6_pq 8021q garp stp llc mrp crc32c_intel serio_raw mptspi vmwgfx drm_kms_helper ttm drm scsi_transport_spi mptscsih e1000 ata_generic mptbase pata_acpi CPU: 3 PID: 660 Comm: vmtoolsd Not tainted 4.2.0-0.rc1.git3.1.fc23.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 0000000000000000 0000000049e617f3 ffff88006ac37ac8 ffffffff818641f5 0000000000000000 ffff88006ac37b20 ffff88006ac37b08 ffffffff810ab446 ffff880068009f40 ffffffff81c63bc0 0000000000000061 0000000000000000 Call Trace: [<ffffffff818641f5>] dump_stack+0x4c/0x65 [<ffffffff810ab446>] warn_slowpath_common+0x86/0xc0 [<ffffffff810ab4d5>] warn_slowpath_fmt+0x55/0x70 [<ffffffff8112551d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810da2bd>] __might_sleep+0x7d/0x90 [<ffffffff812163b3>] __might_fault+0x43/0xa0 [<ffffffff81430477>] copy_from_iter+0x87/0x2a0 [<ffffffffa039460a>] __qp_memcpy_to_queue+0x9a/0x1b0 [vmw_vmci] [<ffffffffa0394740>] ? qp_memcpy_to_queue+0x20/0x20 [vmw_vmci] [<ffffffffa0394757>] qp_memcpy_to_queue_iov+0x17/0x20 [vmw_vmci] [<ffffffffa0394d50>] qp_enqueue_locked+0xa0/0x140 [vmw_vmci] [<ffffffffa039593f>] vmci_qpair_enquev+0x4f/0xd0 [vmw_vmci] [<ffffffffa04847bb>] vmci_transport_stream_enqueue+0x1b/0x20 [vmw_vsock_vmci_transport] [<ffffffffa047ae05>] vsock_stream_sendmsg+0x2c5/0x320 [vsock] [<ffffffff810fabd0>] ? wake_atomic_t_function+0x70/0x70 [<ffffffff81702af8>] sock_sendmsg+0x38/0x50 [<ffffffff81702ff4>] SYSC_sendto+0x104/0x190 [<ffffffff8126e25a>] ? vfs_read+0x8a/0x140 [<ffffffff817042ee>] SyS_sendto+0xe/0x10 [<ffffffff8186d9ae>] entry_SYSCALL_64_fastpath+0x12/0x76 transport->stream_enqueue may call copy_to_user so it should not be called inside a prepare_to_wait. Narrow the scope of the prepare_to_wait to avoid the bad call. This also applies to vsock_stream_recvmsg as well. Reported-by: Vinson Lee <vlee@freedesktop.org> Tested-by: Vinson Lee <vlee@freedesktop.org> Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-13 05:57:39 -05:00
Eric Dumazet	919483096b	ipv4: fix memory leaks in ip_cmsg_send() callers Dmitry reported memory leaks of IP options allocated in ip_cmsg_send() when/if this function returns an error. Callers are responsible for the freeing. Many thanks to Dmitry for the report and diagnostic. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-13 05:57:39 -05:00
Edward Cree	6fa79666e2	net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads All users now pass false, so we can remove it, and remove the code that was conditional upon it. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:52:16 -05:00
Edward Cree	53936107ba	net: gre: Implement LCO for GRE over IPv4 Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:52:16 -05:00
Edward Cree	06f622926d	fou: enable LCO in FOU and GUE Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:52:16 -05:00
Edward Cree	d75f1306d9	net: udp: always set up for CHECKSUM_PARTIAL offload If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:52:15 -05:00
Edward Cree	179bc67f69	net: local checksum offload for encapsulation The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:52:15 -05:00
Eric Dumazet	ea8add2b19	tcp/dccp: better use of ephemeral ports in bind() Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:28:32 -05:00
Eric Dumazet	1580ab63fc	tcp/dccp: better use of ephemeral ports in connect() In commit `07f4c90062` ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-12 05:28:32 -05:00
Linus Torvalds	5de6ac75d9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix BPF handling of branch offset adjustmnets on backjumps, from Daniel Borkmann. 2) Make sure selinux knows about SOCK_DESTROY netlink messages, from Lorenzo Colitti. 3) Fix openvswitch tunnel mtu regression, from David Wragg. 4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric Dumazet. 5) Fix SCTP user hmacid byte ordering bug, from Xin Long. 6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov Kasiviswanathan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: bpf: fix branch offset adjustment on backjumps after patching ctx expansion vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices geneve: Relax MTU constraints vxlan: Relax MTU constraints flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities. selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables sctp: translate network order to host order when users get a hmacid enic: increment devcmd2 result ring in case of timeout tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs net:Add sysctl_max_skb_frags tcp: do not drop syn_recv on all icmp reports ipv6: fix a lockdep splat unix: correctly track in-flight fds in sending process user_struct update be2net maintainers' email addresses dwc_eth_qos: Reset hardware before PHY start ipv6: addrconf: Fix recursive spin lock call	2016-02-11 11:00:34 -08:00
Jesper Dangaard Brouer	15fad714be	net: bulk free SKBs that were delay free'ed due to IRQ context The network stack defers SKBs free, in-case free happens in IRQ or when IRQs are disabled. This happens in __dev_kfree_skb_irq() that writes SKBs that were free'ed during IRQ to the softirq completion queue (softnet_data.completion_queue). These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ in function net_tx_action(). Take advantage of this a use the skb defer and flush API, as we are already in softirq context. For modern drivers this rarely happens. Although most drivers do call dev_kfree_skb_any(), which detects the situation and calls __dev_kfree_skb_irq() when needed. This due to netpoll can call from IRQ context. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 11:59:09 -05:00
Jesper Dangaard Brouer	795bb1c00d	net: bulk free infrastructure for NAPI context, use napi_consume_skb Discovered that network stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk can speedup this slowpath. NAPI context is a bit special, lets take advantage of that for bulk free'ing SKBs. In NAPI context we are running in softirq, which gives us certain protection. A softirq can run on several CPUs at once. BUT the important part is a softirq will never preempt another softirq running on the same CPU. This gives us the opportunity to access per-cpu variables in softirq context. Extend napi_alloc_cache (before only contained page_frag_cache) to be a struct with a small array based stack for holding SKBs. Introduce a SKB defer and flush API for accessing this. Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any() when running in NAPI context. A small trick to handle/detect if we are called from netpoll is to see if budget is 0. In that case, we need to invoke dev_consume_skb_irq(). Joint work with Alexander Duyck. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 11:59:09 -05:00
Nikolay Borisov	165094afce	igmp: Namespacify igmp_qrv sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:59:22 -05:00
Nikolay Borisov	87a8a2ae65	igmp: Namespaceify igmp_llm_reports sysctl knob This was initially introduced in `df2cf4a78e` ("IGMP: Inhibit reports for local multicast groups") by defining the sysctl in the ipv4_net_table array, however it was never implemented to be namespace aware. Fix this by changing the code accordingly. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:59:22 -05:00
Nikolay Borisov	166b6b2d6f	igmp: Namespaceify igmp_max_msf sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:59:22 -05:00
Nikolay Borisov	815c527007	igmp: Namespaceify igmp_max_memberships sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:59:22 -05:00
Tycho Andersen	4a92602aa1	openvswitch: allow management from inside user namespaces Operations with the GENL_ADMIN_PERM flag fail permissions checks because this flag means we call netlink_capable, which uses the init user ns. Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations which should be allowed inside a user namespace. The motivation for this is to be able to run openvswitch in unprivileged containers. I've tested this and it seems to work, but I really have no idea about the security consequences of this patch, so thoughts would be much appreciated. v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one massive one Reported-by: James Page <james.page@canonical.com> Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Eric Biederman <ebiederm@xmission.com> CC: Pravin Shelar <pshelar@ovn.org> CC: Justin Pettit <jpettit@nicira.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:53:19 -05:00
stephen hemminger	f48e72318a	rds: duplicate include net/tcp.h Duplicate include detected. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 09:45:24 -05:00
Alexander Duyck	f245d079c1	net: Allow tunnels to use inner checksum offloads with outer checksums needed This patch enables us to use inner checksum offloads if provided by hardware with outer checksums computed by software. It basically reduces encap_hdr_csum to an advisory flag for now, but based on the fact that SCTP may be getting segmentation support before long I thought we may want to keep it as it is possible we may need to support CRC32c and 1's compliment checksum in the same packet at some point in the future. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	dbef491ebe	udp: Use uh->len instead of skb->len to compute checksum in segmentation The segmentation code was having to do a bunch of work to pull the skb->len and strip the udp header offset before the value could be used to adjust the checksum. Instead of doing all this work we can just use the value that goes into uh->len since that is the correct value with the correct byte order that we need anyway. By using this value we can save ourselves a bunch of pain as there is no need to do multiple byte swaps. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	fdaefd62fd	udp: Clean up the use of flags in UDP segmentation offload This patch goes though and cleans up the logic related to several of the control flags used in UDP segmentation. Specifically the use of dont_encap isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and if it isn't set then we don't need to update the internal headers. As such we can just drop that value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	3872035241	gre: Use inner_proto to obtain inner header protocol Instead of parsing headers to determine the inner protocol we can just pull the value from inner_proto. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	2e598af713	gre: Use GSO flags to determine csum need instead of GRE flags This patch updates the gre checksum path to follow something much closer to the UDP checksum path. By doing this we can avoid needing to do as much header inspection and can just make use of the fields we were already reading in the sk_buff structure. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	ddff00d420	net: Move skb_has_shared_frag check out of GRE code and into segmentation The call skb_has_shared_frag is used in the GRE path and skb_checksum_help to verify that no frags can be modified by an external entity. This check really doesn't belong in the GRE path but in the skb_segment function itself. This way any protocol that might be segmented will be performing this check before attempting to offload a checksum to software. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:34 -05:00
Alexander Duyck	08b64fcca9	net: Store checksum result for offloaded GSO checksums This patch makes it so that we can offload the checksums for a packet up to a certain point and then begin computing the checksums via software. Setting this up is fairly straight forward as all we need to do is reset the values stored in csum and csum_start for the GSO context block. One complication for this is remote checksum offload. In order to allow the inner checksums to be offloaded while computing the outer checksum manually we needed to have some way of indicating that the offload wasn't real. In order to do that I replaced CHECKSUM_PARTIAL with CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer header while skipping computing checksums for the inner headers. We clean up the ip_summed flag and set it to either CHECKSUM_PARTIAL or CHECKSUM_NONE once we hand the packet off to the next lower level. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:33 -05:00
Alexander Duyck	7fbeffed77	net: Update remote checksum segmentation to support use of GSO checksum This patch addresses two main issues. First in the case of remote checksum offload we were avoiding dealing with scatter-gather issues. As a result it would be possible to assemble a series of frames that used frags instead of being linearized as they should have if remote checksum offload was enabled. Second I have updated the code so that we now let GSO take care of doing the checksum on the data itself and drop the special case that was added for remote checksum offload. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:33 -05:00
Alexander Duyck	7644345622	net: Move GSO csum into SKB_GSO_CB This patch moves the checksum maintained by GSO out of skb->csum and into the GSO context block in order to allow for us to work on outer checksums while maintaining the inner checksum offsets in the case of the inner checksum being offloaded, while the outer checksums will be computed. While updating the code I also did a minor cleanu-up on gso_make_checksum. The change is mostly to make it so that we store the values and compute the checksum instead of computing the checksum and then storing the values we needed to update. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:33 -05:00
Alexander Duyck	bef3c6c937	net: Drop unecessary enc_features variable from tunnel segmentation functions The enc_features variable isn't necessary since features isn't used anywhere after we create enc_features so instead just use a destructive AND on features itself and save ourselves the variable declaration. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 08:55:33 -05:00
Johannes Berg	7a02bf892d	ipv6: add option to drop unsolicited neighbor advertisements In certain 802.11 wireless deployments, there will be NA proxies that use knowledge of the network to correctly answer requests. To prevent unsolicitd advertisements on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_unsolicited_na". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 04:27:36 -05:00
Johannes Berg	abbc30436d	ipv6: add option to drop unicast encapsulated in L2 multicast In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv6 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 04:27:36 -05:00
Johannes Berg	97daf33145	ipv4: add option to drop gratuitous ARP packets In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 04:27:35 -05:00
Johannes Berg	12b74dfadb	ipv4: add option to drop unicast encapsulated in L2 multicast In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 04:27:35 -05:00
David Ahern	dc599f76c2	net: Add support for filtering link dump by master device and kind Add support for filtering link dumps by master device and kind, similar to the filtering implemented for neighbor dumps. Each net_device that exists adds between 1196 bytes (eth) and 1556 bytes (bridge) to the link dump. As the number of interfaces increases so does the amount of data pushed to user space for a link list. If the user only wants to see a list of specific devices (e.g., interfaces enslaved to a specific bridge or a list of VRFs) most of that data is thrown away. Passing the filters to the kernel to have only relevant data returned makes the dump more efficient. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 04:18:26 -05:00
Craig Gallek	c125e80b88	soreuseport: fast reuseport TCP socket selection This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:15 -05:00
Craig Gallek	fa46349767	soreuseport: Prep for fast reuseport TCP socket selection Both of the lines in this patch probably should have been included in the initial implementation of this code for generic socket support, but weren't technically necessary since only UDP sockets were supported. First, the sk_reuseport_cb points to a structure which assumes each socket in the group has this pointer assigned at the same time it's added to the array in the structure. The sk_clone_lock function breaks this assumption. Since a child socket shouldn't implicitly be in a reuseport group, the simple fix is to clear the field in the clone. Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that SO_REUSEPORT also be set first. For UDP sockets, this is easily enforced at bind-time since that process both puts the socket in the appropriate receive hlist and updates the reuseport structures. Since these operations can happen at two different times for TCP sockets (bind and listen) it must be explicitly checked to enforce the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the setsockopt call. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:15 -05:00
Craig Gallek	a583636a83	inet: refactor inet[6]_lookup functions to take skb This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:14 -05:00
Craig Gallek	496611d7b5	inet: create IPv6-equivalent inet_hash function In order to support fast lookups for TCP sockets with SO_REUSEPORT, the function that adds sockets to the listening hash set needs to be able to check receive address equality. Since this equality check is different for IPv4 and IPv6, we will need two different socket hashing functions. This patch adds inet6_hash identical to the existing inet_hash function and updates the appropriate references. A following patch will differentiate the two by passing different comparison functions to __inet_hash. Additionally, in order to use the IPv6 address equality function from inet6_hashtables (which is compiled as a built-in object when IPv6 is enabled) it also needs to be in a built-in object file as well. This moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:14 -05:00
Craig Gallek	086c653f58	sock: struct proto hash function may error In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:14 -05:00
Sven Eckelmann	92dcdf09a1	batman-adv: Convert batadv_tt_common_entry to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:06 +08:00
Sven Eckelmann	7c12439115	batman-adv: Convert batadv_orig_node to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:06 +08:00
Sven Eckelmann	161a3be932	batman-adv: Convert batadv_orig_node_vlan to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:05 +08:00
Sven Eckelmann	7a659d5694	batman-adv: Convert batadv_hard_iface to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:05 +08:00
Sven Eckelmann	77ae32e898	batman-adv: Convert batadv_neigh_node to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:04 +08:00
Sven Eckelmann	a6ba0d340d	batman-adv: Convert batadv_orig_ifinfo to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:04 +08:00
Sven Eckelmann	962c68328b	batman-adv: Convert batadv_neigh_ifinfo to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:03 +08:00
Sven Eckelmann	6e8ef69dd4	batman-adv: Convert batadv_tt_orig_list_entry to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:03 +08:00
Sven Eckelmann	32836f56f8	batman-adv: Convert batadv_tvlv_handler to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:03 +08:00
Sven Eckelmann	f7157dd135	batman-adv: Convert batadv_tvlv_container to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:02 +08:00
Sven Eckelmann	68a6722cc4	batman-adv: Convert batadv_dat_entry to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:02 +08:00
Sven Eckelmann	727e0cd59e	batman-adv: Convert batadv_nc_path to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:01 +08:00
Sven Eckelmann	daf99b4810	batman-adv: Convert batadv_nc_node to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:01 +08:00
Sven Eckelmann	71b7e3d316	batman-adv: Convert batadv_bla_claim to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:00 +08:00
Sven Eckelmann	06e56ded86	batman-adv: Convert batadv_bla_backbone_gw to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:24:00 +08:00
Sven Eckelmann	6be4d30c18	batman-adv: Convert batadv_softif_vlan to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:59 +08:00
Sven Eckelmann	e7aed321b8	batman-adv: Convert batadv_gw_node to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:59 +08:00
Sven Eckelmann	90f564dff4	batman-adv: Convert batadv_hardif_neigh_node to kref batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:58 +08:00
Sven Eckelmann	dded069224	batman-adv: Add lockdep assert for container_list_lock The batadv_tvlv_container* functions state in their kernel-doc that they require tvlv.container_list_lock. Add an assert to automatically detect when this might have been ignored by the caller. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:58 +08:00
Simon Wunderlich	81f0268350	batman-adv: add seqno maximum age and protection start flag parameters To allow future use of the window protected function with different maximum sequence numbers, add a parameter to set this value which was previously hardcoded. Another parameter added for future use is a flag to return whether the protection window has started. While at it, also fix the kerneldoc. Signed-off-by: Simon Wunderlich <simon@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:57 +08:00
Sven Eckelmann	140ed8e87c	batman-adv: Drop reference to netdevice on last reference The references to the network device should be dropped inside the release function for batadv_hard_iface similar to what is done with the batman-adv internal datastructures. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>	2016-02-10 23:23:57 +08:00
David Wragg	7e059158d5	vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could transmit vxlan packets of any size, constrained only by the ability to send out the resulting packets. 4.3 introduced netdevs corresponding to tunnel vports. These netdevs have an MTU, which limits the size of a packet that can be successfully encapsulated. The default MTU values are low (1500 or less), which is awkwardly small in the context of physical networks supporting jumbo frames, and leads to a conspicuous change in behaviour for userspace. Instead, set the MTU on openvswitch-created netdevs to be the relevant maximum (i.e. the maximum IP packet size minus any relevant overhead), effectively restoring the behaviour prior to 4.3. Signed-off-by: David Wragg <david@weave.works> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-10 05:50:03 -05:00
Alexander Duyck	461547f315	flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen This patch fixes an issue with unaligned accesses when using eth_get_headlen on a page that was DMA aligned instead of being IP aligned. The fact is when trying to check the length we don't need to be looking at the flow label so we can reorder the checks to first check if we are supposed to gather the flow label and then make the call to actually get it. v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can cause us to check for the flow label. Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 07:07:48 -05:00
Willem de Bruijn	1d036d25e5	packet: tpacket_snd gso and checksum offload Support socket option PACKET_VNET_HDR together with PACKET_TX_RING. When enabled, a struct virtio_net_hdr is expected to precede the data in the ring. The vnet option must be set before the ring is created. The implementation reuses the existing skb_copy_bits code that is used when dev->hard_header_len is non-zero. Move this ll_header check to before the skb alloc and combine it with a test for vnet_hdr->hdr_len. Allocate and copy the max of the two. Verified with test program at github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 06:43:50 -05:00
Willem de Bruijn	8d39b4a6b8	packet: parse tpacket header before skb alloc GSO packet headers must be stored in the linear skb segment. Move tpacket header parsing before sock_alloc_send_skb. The GSO follow-on patch will later increase the skb linear argument to sock_alloc_send_skb if needed for large packets. The header parsing code does not require an allocated skb, so is safe to move. Later pass to tpacket_fill_skb the computed data start and length. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 06:43:50 -05:00
Willem de Bruijn	58d19b19cd	packet: vnet_hdr support for tpacket_rcv Support socket option PACKET_VNET_HDR together with PACKET_RX_RING. When enabled, a struct virtio_net_hdr will precede the data in the packet ring slots. Verified with test program at github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c pkt: 1454269209.798420 len=5066 vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off csum: start=34 off=16 eth: proto=0x800 ip: src=<masked> dst=<masked> proto=6 len=5052 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 06:43:50 -05:00
Willem de Bruijn	16cc140045	packet: move vnet_hdr code to helper functions packet_snd and packet_rcv support virtio net headers for GSO. Move this logic into helper functions to be able to reuse it in tpacket_snd and tpacket_rcv. This is a straighforward code move with one exception. Instead of creating and passing a separate gso_type variable, reuse vnet_hdr.gso_type after conversion from virtio to kernel gso type. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 06:43:50 -05:00
Xin Long	7a84bd4664	sctp: translate network order to host order when users get a hmacid Commit `ed5a377d87` ("sctp: translate host order to network order when setting a hmacid") corrected the hmacid byte-order when setting a hmacid. but the same issue also exists on getting a hmacid. We fix it by changing hmacids to host order when users get them with getsockopt. Fixes: Commit `ed5a377d87` ("sctp: translate host order to network order when setting a hmacid") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 04:53:16 -05:00
Elad Raz	9e8430f8d6	bridge: mdb: Passing the port-group pointer to br_mdb module Passing the port-group to br_mdb in order to allow direct access to the structure. br_mdb will later use the structure to reflect HW reflection status via "state" variable. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 04:42:47 -05:00
Elad Raz	9d06b6d8a3	bridge: mdb: Separate br_mdb_entry->state from net_bridge_port_group->state Change net_bridge_port_group 'state' member to 'flags' and define new set of flags internal to the kernel. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 04:42:47 -05:00
Hans Westgaard Ry	5f74f82ea3	net:Add sysctl_max_skb_frags Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 04:28:06 -05:00
Eric Dumazet	9cf7490360	tcp: do not drop syn_recv on all icmp reports Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets were leading to RST. This is of course incorrect. A specific list of ICMP messages should be able to drop a SYN_RECV. For instance, a REDIRECT on SYN_RECV shall be ignored, as we do not hold a dst per SYN_RECV pseudo request. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751 Fixes: `079096f103` ("tcp/dccp: install syn_recv requests into ehash table") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-09 04:15:37 -05:00
Eric Dumazet	44c3d0c1c0	ipv6: fix a lockdep splat Silence lockdep false positive about rcu_dereference() being used in the wrong context. First one should use rcu_dereference_protected() as we own the spinlock. Second one should be a normal assignation, as no barrier is needed. Fixes: `18367681a1` ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-08 10:33:32 -05:00
Hannes Frederic Sowa	415e3d3e90	unix: correctly track in-flight fds in sending process user_struct The commit referenced in the Fixes tag incorrectly accounted the number of in-flight fds over a unix domain socket to the original opener of the file-descriptor. This allows another process to arbitrary deplete the original file-openers resource limit for the maximum of open files. Instead the sending processes and its struct cred should be credited. To do so, we add a reference counted struct user_struct pointer to the scm_fp_list and use it to account for the number of inflight unix fds. Fixes: `712f4aad40` ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-08 10:30:42 -05:00
Anton Protopopov	5cc6ce9ff2	netfilter: nft_counter: fix erroneous return values The nft_counter_init() and nft_counter_clone() functions should return negative error value -ENOMEM instead of positive ENOMEM. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-02-08 13:05:02 +01:00
Arnd Bergmann	08a7f5d3f5	netfilter: tee: select NF_DUP_IPV6 unconditionally The NETFILTER_XT_TARGET_TEE option selects NF_DUP_IPV6 whenever IP6_NF_IPTABLES is enabled, and it ensures that it cannot be builtin itself if NF_CONNTRACK is a loadable module, as that is a dependency for NF_DUP_IPV6. However, NF_DUP_IPV6 can be enabled even if IP6_NF_IPTABLES is turned off, and it only really depends on IPV6. With the current check in tee_tg6, we call nf_dup_ipv6() whenever NF_DUP_IPV6 is enabled. This can however be a loadable module which is unreachable from a built-in xt_TEE: net/built-in.o: In function `tee_tg6': :(.text+0x67728): undefined reference to `nf_dup_ipv6' The bug was originally introduced in the split of the xt_TEE module into separate modules for ipv4 and ipv6, and two patches tried to fix it unsuccessfully afterwards. This is a revert of the the first incorrect attempt to fix it, going back to depending on IPV6 as the dependency, and we adapt the 'select' condition accordingly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `bbde9fc182` ("netfilter: factor out packet duplication for IPv4/IPv6") Fixes: `116984a316` ("netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)") Fixes: `74ec4d55c4` ("netfilter: fix xt_TEE and xt_TPROXY dependencies") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-02-08 12:58:28 +01:00
Phil Turnbull	c58d6c9368	netfilter: nfnetlink: correctly validate length of batch messages If nlh->nlmsg_len is zero then an infinite loop is triggered because 'skb_pull(skb, msglen);' pulls zero bytes. The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len < NLMSG_HDRLEN' which bypasses the length validation and will later trigger an out-of-bound read. If the length validation does fail then the malformed batch message is copied back to userspace. However, we cannot do this because the nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in netlink_ack: [ 41.455421] ================================================================== [ 41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340 [ 41.456431] Read of size 4294967280 by task a.out/987 [ 41.456431] ============================================================================= [ 41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected [ 41.456431] ----------------------------------------------------------------------------- ... [ 41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00 ................ [ 41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00 ............... [ 41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05 .......@EV."3... [ 41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb ................ ^^ start of batch nlmsg with nlmsg_len=4294967280 ... [ 41.456431] Memory state around the buggy address: [ 41.456431] ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ^ [ 41.456431] ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb [ 41.456431] ================================================================== Fix this with better validation of nlh->nlmsg_len and by setting NFNL_BATCH_FAILURE if any batch message fails length validation. CAP_NET_ADMIN is required to trigger the bugs. Fixes: `9ea2aa8b7d` ("netfilter: nfnetlink: validate nfnetlink header from batch") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-02-08 12:56:54 +01:00
David S. Miller	0aca737d46	tcp: Fix syncookies sysctl default. Unintentionally the default was changed to zero, fix that. Fixes: `12ed8244ed` ("ipv4: Namespaceify tcp syncookies sysctl knob") Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-08 04:24:33 -05:00
Nikolay Borisov	4979f2d9f7	ipv4: Namespaceify tcp_notsent_lowat sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:36:11 -05:00
Nikolay Borisov	1e579caa18	ipv4: Namespaceify tcp_fin_timeout sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:36:11 -05:00
Nikolay Borisov	c402d9beff	ipv4: Namespaceify tcp_orphan_retries sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:11 -05:00
Nikolay Borisov	c6214a97c8	ipv4: Namespaceify tcp_retries2 sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:11 -05:00
Nikolay Borisov	ae5c3f406c	ipv4: Namespaceify tcp_retries1 sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:10 -05:00
Nikolay Borisov	1043e25ff9	ipv4: Namespaceify tcp reordering sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:10 -05:00
Nikolay Borisov	12ed8244ed	ipv4: Namespaceify tcp syncookies sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:10 -05:00
Nikolay Borisov	7c083ecb3b	ipv4: Namespaceify tcp synack retries sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:10 -05:00
Nikolay Borisov	6fa2516630	ipv4: Namespaceify tcp syn retries sysctl knob Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:35:10 -05:00
David S. Miller	9d1eb21b59	This batch of patches includes a number of corrections and improvements for our kernel-doc. These changes also make sure that our doc is now properly processed by the kernel-doc parsing tool. Other than that you have a patch updating all the copyright lines to 2016 and another patch switching the URLs in our readme, Kconfig and MAINTAINERS file from "http" to "https". Both by Sven Eckelmann. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWsV4cAAoJENpFlCjNi1MRXloP/2tpnEdIABtAzZWvdUeE7eLP 7yZGhociLYYWRbGc/uiKeHBr9UvZy+Abo0qJDrW/10BHuFb2SmIqZwBINEhIiLAm +jpwzGKWvbeXPU/fQ9EfhG9CR7U2ceeI6c/0F8eAAofTzDZSEu2JTk43/wzXhh46 LuKKHhvqZ3T+S7LLrMm0uJmwPBGJwdFYEWgvG18WB/1VEJTsrYipM/tbinpEKfiw QaGYXClQmOx/n0yo46/H5lR8VB4RfCF8Gtx6+5teMcYXtVO6j6lnacgoztt5P7Ku 6j/9+r/5xXmFMO/D+xkE/QprbYtomB+R4t8c2FaQZxae4FN5sIEggFphFj5XwQ4L OpuQWFH0UhJLaxFjq54lO0tzMICy6tpQIy0EstgKoa91VOOlyniS0bb3up9/tP3O nWpPbgVz+xB3KbAdUT0cPcN0TePGJxLC0v1Gli1KjwiVkU0kC+ADfXj54b+aNn5T N5aTbWFiJyW/9p+hDEh+/NBqg56v40qFri7KcEBNWsV6qI2R9pmKyj68oyV/GVYS wYi+6wIKjwW4x6Gk1AqeZsJ+nj5QVnzYOTf1XT1zGQjwseMJxUhHW2/ct2Fae7PM 9CHZVEd129BdE66NRlWzjjfefMjNlIFqMKxLGyqEKQp5ClWDdfsHtiC78ry1fHWR n4ymPVPZXOTQ+plyObr2 =IfQN -----END PGP SIGNATURE----- Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge Antonio Quartulli says: ==================== This batch of patches includes a number of corrections and improvements for our kernel-doc. These changes also make sure that our doc is now properly processed by the kernel-doc parsing tool. Other than that you have a patch updating all the copyright lines to 2016 and another patch switching the URLs in our readme, Kconfig and MAINTAINERS file from "http" to "https". Both by Sven Eckelmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:32:50 -05:00
Yuchung Cheng	d452e6caf8	tcp: tcp_cong_control helper Refactor and consolidate cwnd and rate updates into a new function tcp_cong_control(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:51 -05:00
Yuchung Cheng	2d14a4def4	tcp: make congestion control more robust against reordering This change enables congestion control to update cwnd based on not only packet cumulatively acked but also packets delivered out-of-order. This makes congestion control robust against packet reordering because it may raise cwnd as long as packets are being delivered once reordering has been detected (i.e., it only cares the amount of packets delivered, not the ordering among them). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:51 -05:00
Yuchung Cheng	3ebd887105	tcp: refactor pkts acked accounting A small refactoring that gets number of packets cumulatively acked from tcp_clean_rtx_queue() directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:51 -05:00
Yuchung Cheng	ddf1af6fa0	tcp: new delivery accounting This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:51 -05:00
Yuchung Cheng	31ba0c1072	tcp: move cwnd reduction after recovery state procesing Currently the cwnd is reduced and increased in various different places. The reduction happens in various places in the recovery state processing (tcp_fastretrans_alert) while the increase happens afterward. A better sequence is to identify lost packets and update the congestion control state (icsk_ca_state) first. Then base on the new state, up/down the cwnd in one central place. It's more clear to reason cwnd changes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:50 -05:00
Yuchung Cheng	e662ca40de	tcp: retransmit after recovery processing and congestion control The retransmission and F-RTO transmission currently happen inside recovery state processing (tcp_fastretrans_alert) but before congestion control. This refactoring moves the logic after both s.t. we can determine how much to send (cwnd) before deciding what to send. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:09:50 -05:00
David Herrmann	3575dbf2cb	net: drop write-only stack variable Remove a write-only stack variable from unix_attach_fds(). This is a left-over from the security fix in: commit `712f4aad40` Author: willy tarreau <w@1wt.eu> Date: Sun Jan 10 07:54:56 2016 +0100 unix: properly account for FDs passed over unix sockets Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-07 14:06:26 -05:00
Eric Dumazet	e3e17b773b	tcp: fastopen: call tcp_fin() if FIN present in SYNACK When we acknowledge a FIN, it is not enough to ack the sequence number and queue the skb into receive queue. We also have to call tcp_fin() to properly update socket state and send proper poll() notifications. It seems we also had the problem if we received a SYN packet with the FIN flag set, but it does not seem an urgent issue, as no known implementation can do that. Fixes: `61d2bcae99` ("tcp: fastopen: accept data/FIN present in SYNACK message") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 16:49:58 -05:00
Parthasarathy Bhuvaragan	06c8581f85	tipc: use alloc_ordered_workqueue() instead of WQ_UNBOUND w/ max_active = 1 Until now, tipc_rcv and tipc_send workqueues in server are allocated with parameters WQ_UNBOUND & max_active = 1. This parameters passed to this function makes it equivalent to alloc_ordered_workqueue(). The later form is more explicit and can inherit future ordered_workqueue changes. In this commit we replace alloc_workqueue() with more readable alloc_ordered_workqueue(). Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	ae245557f8	tipc: donot create timers if subscription timeout = TIPC_WAIT_FOREVER Until now, we create timers even for the subscription requests with timeout = TIPC_WAIT_FOREVER. This can be improved by avoiding timer creation when the timeout is set to TIPC_WAIT_FOREVER. In this commit, we introduce a check to creates timers only when timeout != TIPC_WAIT_FOREVER. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	f3ad288c56	tipc: protect tipc_subscrb_get() with subscriber spin lock Until now, during subscription creation the mod_time() & tipc_subscrb_get() are called after releasing the subscriber spin lock. In a SMP system when performing a subscription creation, if the subscription timeout occurs simultaneously (the timer is scheduled to run on another CPU) then the timer thread might decrement the subscribers refcount before the create thread increments the refcount. This can be simulated by creating subscription with timeout=0 and sometimes the timeout occurs before the create request is complete. This leads to the following message: [30.702949] BUG: spinlock bad magic on CPU#1, kworker/u8:3/87 [30.703834] general protection fault: 0000 [#1] SMP [30.704826] CPU: 1 PID: 87 Comm: kworker/u8:3 Not tainted 4.4.0-rc8+ #18 [30.704826] Workqueue: tipc_rcv tipc_recv_work [tipc] [30.704826] task: ffff88003f878600 ti: ffff88003fae0000 task.ti: ffff88003fae0000 [30.704826] RIP: 0010:[<ffffffff8109196c>] [<ffffffff8109196c>] spin_dump+0x5c/0xe0 [...] [30.704826] Call Trace: [30.704826] [<ffffffff81091a16>] spin_bug+0x26/0x30 [30.704826] [<ffffffff81091b75>] do_raw_spin_lock+0xe5/0x120 [30.704826] [<ffffffff81684439>] _raw_spin_lock_bh+0x19/0x20 [30.704826] [<ffffffffa0096f10>] tipc_subscrb_rcv_cb+0x1d0/0x330 [tipc] [30.704826] [<ffffffffa00a37b1>] tipc_receive_from_sock+0xc1/0x150 [tipc] [30.704826] [<ffffffffa00a31df>] tipc_recv_work+0x3f/0x80 [tipc] [30.704826] [<ffffffff8106a739>] process_one_work+0x149/0x3c0 [30.704826] [<ffffffff8106aa16>] worker_thread+0x66/0x460 [30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0 [30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0 [30.704826] [<ffffffff8107029d>] kthread+0xed/0x110 [30.704826] [<ffffffff810701b0>] ? kthread_create_on_node+0x190/0x190 [30.704826] [<ffffffff81684bdf>] ret_from_fork+0x3f/0x70 In this commit, 1. we remove the check for the return code for mod_timer() 2. we protect tipc_subscrb_get() using the subscriber spin lock. We increment the subscriber's refcount as soon as we add the subscription to subscriber's subscription list. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	d4091899c9	tipc: hold subscriber->lock for tipc_nametbl_subscribe() Until now, while creating a subscription the subscriber lock protects only the subscribers subscription list and not the nametable. The call to tipc_nametbl_subscribe() is outside the lock. However, at subscription timeout and cancel both the subscribers subscription list and the nametable are protected by the subscriber lock. This asymmetric locking mechanism leads to the following problem: In a SMP system, the timer can be fire on another core before the create request is complete. When the timer thread calls tipc_nametbl_unsubscribe() before create thread calls tipc_nametbl_subscribe(), we get a nullptr exception. This can be simulated by creating subscription with timeout=0 and sometimes the timeout occurs before the create request is complete. The following is the oops: [57.569661] BUG: unable to handle kernel NULL pointer dereference at (null) [57.577498] IP: [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/0x120 [tipc] [57.584820] PGD 0 [57.586834] Oops: 0002 [#1] SMP [57.685506] CPU: 14 PID: 10077 Comm: kworker/u40:1 Tainted: P OENX 3.12.48-52.27.1. 9688.1.PTF-default #1 [57.703637] Workqueue: tipc_rcv tipc_recv_work [tipc] [57.708697] task: ffff88064c7f00c0 ti: ffff880629ef4000 task.ti: ffff880629ef4000 [57.716181] RIP: 0010:[<ffffffffa02135aa>] [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/ 0x120 [tipc] [...] [57.812327] Call Trace: [57.814806] [<ffffffffa0211c77>] tipc_subscrp_delete+0x37/0x90 [tipc] [57.821357] [<ffffffffa0211e2f>] tipc_subscrp_timeout+0x3f/0x70 [tipc] [57.827982] [<ffffffff810618c1>] call_timer_fn+0x31/0x100 [57.833490] [<ffffffff81062709>] run_timer_softirq+0x1f9/0x2b0 [57.839414] [<ffffffff8105a795>] __do_softirq+0xe5/0x230 [57.844827] [<ffffffff81520d1c>] call_softirq+0x1c/0x30 [57.850150] [<ffffffff81004665>] do_softirq+0x55/0x90 [57.855285] [<ffffffff8105aa35>] irq_exit+0x95/0xa0 [57.860290] [<ffffffff815215b5>] smp_apic_timer_interrupt+0x45/0x60 [57.866644] [<ffffffff8152005d>] apic_timer_interrupt+0x6d/0x80 [57.872686] [<ffffffffa02121c5>] tipc_subscrb_rcv_cb+0x2a5/0x3f0 [tipc] [57.879425] [<ffffffffa021c65f>] tipc_receive_from_sock+0x9f/0x100 [tipc] [57.886324] [<ffffffffa021c826>] tipc_recv_work+0x26/0x60 [tipc] [57.892463] [<ffffffff8106fb22>] process_one_work+0x172/0x420 [57.898309] [<ffffffff8107079a>] worker_thread+0x11a/0x3c0 [57.903871] [<ffffffff81077114>] kthread+0xb4/0xc0 [57.908751] [<ffffffff8151f318>] ret_from_fork+0x58/0x90 In this commit, we do the following at subscription creation: 1. set the subscription's subscriber pointer before performing tipc_nametbl_subscribe(), as this value is required further in the call chain ex: by tipc_subscrp_send_event(). 2. move tipc_nametbl_subscribe() under the scope of subscriber lock Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	cb01c7c870	tipc: fix connection abort when receiving invalid cancel request Until now, the subscribers endianness for a subscription create/cancel request is determined as: swap = !(s->filter & (TIPC_SUB_PORTS \| TIPC_SUB_SERVICE)) The checks are performed only for port/service subscriptions. The swap calculation is incorrect if the filter in the subscription cancellation request is set to TIPC_SUB_CANCEL (it's a malformed cancel request, as the corresponding subscription create filter is missing). Thus, the check if the request is for cancellation fails and the request is treated as a subscription create request. The subscription creation fails as the request is illegal, which terminates this connection. In this commit we determine the endianness by including TIPC_SUB_CANCEL, which will set swap correctly and the request is processed as a cancellation request. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	c8beccc67c	tipc: fix connection abort during subscription cancellation In 'commit `7fe8097cef` ("tipc: fix nullpointer bug when subscribing to events")', we terminate the connection if the subscription creation fails. In the same commit, the subscription creation result was based on the value of subscription pointer (set in the function) instead of the return code. Unfortunately, the same function also handles subscription cancellation request. For a subscription cancellation request, the subscription pointer cannot be set. Thus the connection is terminated during cancellation request. In this commit, we move the subcription cancel check outside of tipc_subscrp_create(). Hence, - tipc_subscrp_create() will create a subscripton - tipc_subscrb_rcv_cb() will subscribe or cancel a subscription. Fixes: 'commit `7fe8097cef` ("tipc: fix nullpointer bug when subscribing to events")' Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan	7c13c62241	tipc: introduce tipc_subscrb_subscribe() routine In this commit, we split tipc_subscrp_create() into two: 1. tipc_subscrp_create() creates a subscription 2. A new function tipc_subscrp_subscribe() adds the subscription to the subscriber subscription list, activates the subscription timer and subscribes to the nametable updates. In future commits, the purpose of tipc_subscrb_rcv_cb() will be to either subscribe or cancel a subscription. There is no functional change in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:41:57 -05:00
Parthasarathy Bhuvaragan	a4273c73eb	tipc: remove struct tipc_name_seq from struct tipc_subscription Until now, struct tipc_subscriber has duplicate fields for type, upper and lower (as member of struct tipc_name_seq) at: 1. as member seq in struct tipc_subscription 2. as member seq in struct tipc_subscr, which is contained in struct tipc_event The former structure contains the type, upper and lower values in network byte order and the later contains the intact copy of the request. The struct tipc_subscription contains a field swap to determine if request needs network byte order conversion. Thus by using swap, we can convert the request when required instead of duplicating it. In this commit, 1. we remove the references to these elements as members of struct tipc_subscription and replace them with elements from struct tipc_subscr. 2. provide new functions to convert the user request into network byte order. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:40:43 -05:00
Parthasarathy Bhuvaragan	3086523149	tipc: remove filter and timeout elements from struct tipc_subscription Until now, struct tipc_subscription has duplicate timeout and filter attributes present: 1. directly as members of struct tipc_subscription 2. in struct tipc_subscr, which is contained in struct tipc_event In this commit, we remove the references to these elements as members of struct tipc_subscription and replace them with elements from struct tipc_subscr. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:40:43 -05:00
Parthasarathy Bhuvaragan	4f61d4ef70	tipc: remove incorrect check for subscription timeout value Until now, during subscription creation we set sub->timeout by converting the timeout request value in milliseconds to jiffies. This is followed by setting the timeout value in the timer if sub->timeout != TIPC_WAIT_FOREVER. For a subscription create request with a timeout value of TIPC_WAIT_FOREVER, msecs_to_jiffies(TIPC_WAIT_FOREVER) returns MAX_JIFFY_OFFSET (0xfffffffe). This is not equal to TIPC_WAIT_FOREVER (0xffffffff). In this commit, we remove this check. Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:40:43 -05:00
Kim Jones	ba905f5e2f	ethtool: Declare netdev_rss_key as __read_mostly. netdev_rss_key is written to once and thereafter is read by drivers when they are initialising. The fact that it is mostly read and not written to makes it a candidate for a __read_mostly declaration. Signed-off-by: Kim Jones <kim-marie.jones@intel.com> Signed-off-by: Alan Carey <alan.carey@intel.com> Acked-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:13:49 -05:00
Eric Dumazet	9d691539ee	tcp: do not enqueue skb with SYN flag If we remove the SYN flag from the skbs that tcp_fastopen_add_skb() places in socket receive queue, then we can remove the test that tcp_recvmsg() has to perform in fast path. All we have to do is to adjust SEQ in the slow path. For the moment, we place an unlikely() and output a message if we find an skb having SYN flag set. Goal would be to get rid of the test completely. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:11:59 -05:00
Eric Dumazet	61d2bcae99	tcp: fastopen: accept data/FIN present in SYNACK message RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message MAY include data and/or FIN This patch adds support for the client side : If we receive a SYNACK with payload or FIN, queue the skb instead of ignoring it. Since we already support the same for SYN, we refactor the existing code and reuse it. Note we need to clone the skb, so this operation might fail under memory pressure. Sara Dickinson pointed out FreeBSD server Fast Open implementation was planned to generate such SYNACK in the future. The server side might be implemented on linux later. Reported-by: Sara Dickinson <sara@sinodun.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:11:59 -05:00

... 3 4 5 6 7 ...

41173 Коммитов