v2:
* Replace busy wait with wait_event()/wake_up_all()
* Cannot garantee that at the time xennet_remove is called, the
xen_netback state will not be XenbusStateClosed, so added a
condition for that
* There's a small chance for the xen_netback state is
XenbusStateUnknown by the time the xen_netfront switches to Closed,
so added a condition for that.
When unloading module xen_netfront from guest, dmesg would output
warning messages like below:
[ 105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
[ 105.236839] deferring g.e. 0x903 (pfn 0x35805)
This problem relies on netfront and netback being out of sync. By the time
netfront revokes the g.e.'s netback didn't have enough time to free all of
them, hence displaying the warnings on dmesg.
The trick here is to make netfront to wait until netback frees all the g.e.'s
and only then continue to cleanup for the module removal, and this is done by
manipulating both device states.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
Acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC791 specifies the minimum MTU to be 68, while xen-net{front|back}
drivers use a minimum value of 0.
When set MTU to 0~67 with xen_net{front|back} driver, the network
will become unreachable immediately, the guest can no longer be pinged.
xen_net{front|back} should not allow the user to set this value which causes
network problems.
Reported-by: Chen Shi <cheshi@redhat.com>
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
xennet_start_xmit() might copy skb with inappropriate layout
into a fresh one.
Old skb is freed, and at this point it is not a drop, but
a consume. New skb will then be either consumed or dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unavoidable crashes in netfront_resume() and netback_changed() after a
previous fail in talk_to_netback() (e.g. when we fail to read MAC from
xenstore) were discovered. The failure path in talk_to_netback() does
unregister/free for netdev but we don't reset drvdata and we try accessing
it after resume.
Fix the bug by removing the whole xen device completely with
device_unregister(), this guarantees we won't have any calls into netfront
after a failure.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
Varadhan.
2) Simplify classifier state on sk_buff in order to shrink it a bit.
From Willem de Bruijn.
3) Introduce SIPHASH and it's usage for secure sequence numbers and
syncookies. From Jason A. Donenfeld.
4) Reduce CPU usage for ICMP replies we are going to limit or
suppress, from Jesper Dangaard Brouer.
5) Introduce Shared Memory Communications socket layer, from Ursula
Braun.
6) Add RACK loss detection and allow it to actually trigger fast
recovery instead of just assisting after other algorithms have
triggered it. From Yuchung Cheng.
7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.
8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.
9) Export MPLS packet stats via netlink, from Robert Shearman.
10) Significantly improve inet port bind conflict handling, especially
when an application is restarted and changes it's setting of
reuseport. From Josef Bacik.
11) Implement TX batching in vhost_net, from Jason Wang.
12) Extend the dummy device so that VF (virtual function) features,
such as configuration, can be more easily tested. From Phil
Sutter.
13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
Dumazet.
14) Add new bpf MAP, implementing a longest prefix match trie. From
Daniel Mack.
15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.
16) Add new aquantia driver, from David VomLehn.
17) Add bpf tracepoints, from Daniel Borkmann.
18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
Florian Fainelli.
19) Remove custom busy polling in many drivers, it is done in the core
networking since 4.5 times. From Eric Dumazet.
20) Support XDP adjust_head in virtio_net, from John Fastabend.
21) Fix several major holes in neighbour entry confirmation, from
Julian Anastasov.
22) Add XDP support to bnxt_en driver, from Michael Chan.
23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.
24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.
25) Support GRO in IPSEC protocols, from Steffen Klassert"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
Revert "ath10k: Search SMBIOS for OEM board file extension"
net: socket: fix recvmmsg not returning error from sock_error
bnxt_en: use eth_hw_addr_random()
bpf: fix unlocking of jited image when module ronx not set
arch: add ARCH_HAS_SET_MEMORY config
net: napi_watchdog() can use napi_schedule_irqoff()
tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
net/hsr: use eth_hw_addr_random()
net: mvpp2: enable building on 64-bit platforms
net: mvpp2: switch to build_skb() in the RX path
net: mvpp2: simplify MVPP2_PRS_RI_* definitions
net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
net: mvpp2: remove unused register definitions
net: mvpp2: simplify mvpp2_bm_bufs_add()
net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
net: mvpp2: release reference to txq_cpu[] entry after unmapping
net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYrElpAAoJELDendYovxMvNFQH/RJU7lwDSf7rF7ZzFGvdcsfi
T4DDuowkYoJm2+GypoRVzZZ0lxJlxr0mNKPvgGDvuTogMY7pvjAf6B7/xCvTFsNU
UoO2I7ljgXxCXFRiXH50nAjS7PC2PFW3Qx+8XPIWeZmnUPeJi4Q43fiSloUt+a6l
JgS/autOCflGasR5MihCZXkvdVF81K6GuEd3hCh9GKZ/8RiwNPaY50vHnMv/hfqq
SJNRKOTSRXioYlTohLnjuPWHDMayRJEO48IXl3c7aNxDTkHjn78yaoPhJ7m+M0Bq
s2GyaQA4tCwADhP+tilmI5H1vFpt6w9x7O0dgWiSm7TB91lwfZOen2WhTPPp6L8=
=gktV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes:
- a series from Boris Ostrovsky adding support for booting Linux as
Xen PVH guest
- a series from Juergen Gross streamlining the xenbus driver
- a series from Paul Durrant adding support for the new device model
hypercall
- several small corrections"
* tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add IOCTL_PRIVCMD_RESTRICT
xen/privcmd: Add IOCTL_PRIVCMD_DM_OP
xen/privcmd: return -ENOTTY for unimplemented IOCTLs
xen: optimize xenbus driver for multiple concurrent xenstore accesses
xen: modify xenstore watch event interface
xen: clean up xenbus internal headers
xenbus: Neaten xenbus_va_dev_error
xen/pvh: Use Xen's emergency_restart op for PVH guests
xen/pvh: Enable CPU hotplug
xen/pvh: PVH guests always have PV devices
xen/pvh: Initialize grant table for PVH guests
xen/pvh: Make sure we don't use ACPI_IRQ_MODEL_PIC for SCI
xen/pvh: Bootstrap PVH guest
xen/pvh: Import PVH-related Xen public interfaces
xen/x86: Remove PVH support
x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C
xen/manage: correct return value check on xenbus_scanf()
x86/xen: Fix APIC id mismatch warning on Intel
xen/netback: set default upper limit of tx/rx queues to 8
xen/netfront: set default upper limit of tx/rx queues to 8
rx_refill_timer should be deleted as soon as we disconnect from the
backend since otherwise it is possible for the timer to go off before
we get to xennet_destroy_queues(). If this happens we may dereference
queue->rx.sring which is set to NULL in xennet_disconnect_backend().
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a crash when running out of grant refs when creating many
queues across many netdevs.
* If creating queues fails (i.e. there are no grant refs available),
call xenbus_dev_fatal() to ensure that the xenbus device is set to the
closed state.
* If no queues are created, don't call xennet_disconnect_backend as
netdev->real_num_tx_queues will not have been set correctly.
* If setup_netfront() fails, ensure that all the queues created are
cleaned up, not just those that have been set up.
* If any queues were set up and an error occurs, call
xennet_destroy_queues() to clean up the napi context.
* If any fatal error occurs, unregister and destroy the netdev to avoid
leaving around a half setup network device.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 90c311b0ee ("xen-netfront: Fix Rx stall during network
stress and OOM") caused the refill timer to be triggerred almost on
all invocations of xennet_alloc_rx_buffers for certain workloads.
This reworks the fix by reverting to the old behaviour and taking into
consideration the skb allocation failure. Refill timer is now triggered
on insufficient requests or skb allocation failure.
Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Fixes: 90c311b0ee (xen-netfront: Fix Rx stall during network stress and OOM)
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi_complete_done() allows to opt-in for gro_flush_timeout,
added back in linux-3.19, commit 3b47d30396
("net: gro: add a per device gro flush timer")
This allows for more efficient GRO aggregation without
sacrifying latencies.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default for the number of tx/rx queues of one interface is the
number of vcpus of the system today. As each queue pair reserves 512
grant pages this default consumes a ridiculous number of grants for
large guests.
Limit the queue number to 8 as default. This value can be modified
via a module parameter if required.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
During an OOM scenario, request slots could not be created as skb
allocation fails. So the netback cannot pass in packets and netfront
wrongly assumes that there is no more work to be done and it disables
polling. This causes Rx to stall.
The issue is with the retry logic which schedules the timer if the
created slots are less than NET_RX_SLOTS_MIN. The count of new request
slots to be pushed are calculated as a difference between new req_prod
and rsp_cons which could be more than the actual slots, if there are
unconsumed responses.
The fix is to calculate the count of newly created slots as the
difference between new req_prod and old req_prod.
Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.
Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYT5HMAAoJELDendYovxMvhNQH/1g/3ahM4JKN8Z0SbjKBEdQm
yj2xOj6cE3l6wMSUblKjZD2DLLhpmcHT/E97Xro/lZQEfQJoMXXWWDFowMU/P1LA
mJxb7Fzq5Wr+6eGSAlIQB270MrpNi/luf+CWHMwVA3V7R3KRXwonOdGQSkISIzCd
tgIydEA3a9r2+HgeIBpZFZ4GcSrJQU75krMyl2tjD1C+jeYVd+zdoj2OnDsZQDZQ
hDWApMpNbpSBAn7JtSSdXWSTBsGH0lUECebeYPhPQ2sX2P6Y8+UCGwA7i6FFdbTa
agXfVSdRz8dCe3k19VcKDAw6nK9BTTMnEeEHmkmygIh6wuHPP44CzigTXIbJoXI=
=zjfm
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes for 4.10
These are some fixes, a move of some arm related headers to share them
between arm and arm64 and a series introducing a helper to make code
more readable.
The most notable change is David stepping down as maintainer of the
Xen hypervisor interface. This results in me sending you the pull
requests for Xen related code from now on"
* tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
xen/balloon: Only mark a page as managed when it is released
xenbus: fix deadlock on writes to /proc/xen/xenbus
xen/scsifront: don't request a slot on the ring until request is ready
xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
x86: Make E820_X_MAX unconditionally larger than E820MAX
xen/pci: Bubble up error and fix description.
xen: xenbus: set error code on failure
xen: set error code on failures
arm/xen: Use alloc_percpu rather than __alloc_percpu
arm/arm64: xen: Move shared architecture headers to include/xen/arm
xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
xen-scsifront: Add a missing call to kfree
MAINTAINERS: update XEN HYPERVISOR INTERFACE
xenfs: Use proc_create_mount_point() to create /proc/xen
xen-platform: use builtin_pci_driver
xen-netback: fix error handling output
xen: make use of xenbus_read_unsigned() in xenbus
xen: make use of xenbus_read_unsigned() in xen-pciback
xen: make use of xenbus_read_unsigned() in xen-fbfront
...
Use xenbus_read_unsigned() instead of xenbus_scanf() when possible.
This requires to change the type of some reads from int to unsigned,
but these cases have been wrong before: negative values are not allowed
for the modified cases.
Cc: netdev@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
IS_ERR_VALUE() in commit 87557efc27
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().
This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hyperv_net:
- set min/max_mtu, per Haiyang, after rndis_filter_device_add
virtio_net:
- set min/max_mtu
- remove virtnet_change_mtu
vmxnet3:
- set min/max_mtu
xen-netback:
- min_mtu = 0, max_mtu = 65517
xen-netfront:
- min_mtu = 0, max_mtu = 65535
unisys/visor:
- clean up defines a little to not clash with network core or add
redundat definitions
CC: netdev@vger.kernel.org
CC: virtualization@lists.linux-foundation.org
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Shrikrishna Khare <skhare@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: David Kershner <david.kershner@unisys.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small packet loss is reported on complex multi host network configurations
including tunnels, NAT, ... My investigation led me to the following check
in netback which drops packets:
if (unlikely(txreq.size < ETH_HLEN)) {
netdev_err(queue->vif->dev,
"Bad packet size: %d\n", txreq.size);
xenvif_tx_err(queue, &txreq, extra_count, idx);
break;
}
But this check itself is legitimate. SKBs consist of a linear part (which
has to have the ethernet header) and (optionally) a number of frags.
Netfront transmits the head of the linear part up to the page boundary
as the first request and all the rest becomes frags so when we're
reconstructing the SKB in netback we can't distinguish between original
frags and the 'tail' of the linear part. The first SKB needs to be at
least ETH_HLEN size. So in case we have an SKB with its linear part
starting too close to the page boundary the packet is lost.
I see two ways to fix the issue:
- Change the 'wire' protocol between netfront and netback to start keeping
the original SKB structure. We'll have to add a flag indicating the fact
that the particular request is a part of the original linear part and not
a frag. We'll need to know the length of the linear part to pre-allocate
memory.
- Avoid transmitting SKBs with linear parts starting too close to the page
boundary. That seems preferable short-term and shouldn't bring
significant performance degradation as such packets are rare. That's what
this patch is trying to achieve with skb_copy().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trying to batch Tx response events results in poor performance because
this delays freeing the transmitted skbs.
Instead use the standard RING_FINAL_CHECK_FOR_RESPONSES() macro to be
notified once the next Tx response is placed on the ring.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Improve balloon driver memory hotplug placement.
- Use unpopulated hotplugged memory for foreign pages (if
supported/enabled).
- Support 64 KiB guest pages on arm64.
- CPU hotplug support on arm/arm64.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWOeSkAAoJEFxbo/MsZsTRph0H/0nE8Tx0GyGtOyCYfBdInTvI
WgjvL8VR1XrweZMVis3668MzhLSYg6b5lvJsoi+L3jlzYRyze43iHXsKfvp+8p0o
TVUhFnlHEHF8ASEtPydAi6HgS7Dn9OQ9LaZ45R1Gk0rHnwJjIQonhTn2jB0yS9Am
Hf4aZXP2NVZphjYcloqNsLH0G6mGLtgq8cS0uKcVO2YIrR4Dr3sfj9qfq9mflf8n
sA/5ifoHRfOUD1vJzYs4YmIBUv270jSsprWK/Mi2oXIxUTBpKRAV1RVCAPW6GFci
HIZjIJkjEPWLsvxWEs0dUFJQGp3jel5h8vFPkDWBYs3+9rILU2DnLWpKGNDHx3k=
=vUfa
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
- Improve balloon driver memory hotplug placement.
- Use unpopulated hotplugged memory for foreign pages (if
supported/enabled).
- Support 64 KiB guest pages on arm64.
- CPU hotplug support on arm/arm64.
* tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (44 commits)
xen: fix the check of e_pfn in xen_find_pfn_range
x86/xen: add reschedule point when mapping foreign GFNs
xen/arm: don't try to re-register vcpu_info on cpu_hotplug.
xen, cpu_hotplug: call device_offline instead of cpu_down
xen/arm: Enable cpu_hotplug.c
xenbus: Support multiple grants ring with 64KB
xen/grant-table: Add an helper to iterate over a specific number of grants
xen/xenbus: Rename *RING_PAGE* to *RING_GRANT*
xen/arm: correct comment in enlighten.c
xen/gntdev: use types from linux/types.h in userspace headers
xen/gntalloc: use types from linux/types.h in userspace headers
xen/balloon: Use the correct sizeof when declaring frame_list
xen/swiotlb: Add support for 64KB page granularity
xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb
arm/xen: Add support for 64KB page granularity
xen/privcmd: Add support for Linux 64KB page granularity
net/xen-netback: Make it running on 64KB page granularity
net/xen-netfront: Make it running on 64KB page granularity
block/xen-blkback: Make it running on 64KB page granularity
block/xen-blkfront: Make it running on 64KB page granularity
...
Conflicts:
net/ipv6/xfrm6_output.c
net/openvswitch/flow_netlink.c
net/openvswitch/vport-gre.c
net/openvswitch/vport-vxlan.c
net/openvswitch/vport.c
net/openvswitch/vport.h
The openvswitch conflicts were overlapping changes. One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.
The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.
Signed-off-by: David S. Miller <davem@davemloft.net>
The PV network protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using network
device on a non-modified Xen.
It's only necessary to adapt the ring size and break skb data in small
chunk of 4KB. The rest of the code is relying on the grant table code.
Note that we allocate a Linux page for each rx skb but only the first
4KB is used. We may improve the memory usage by extending the size of
the rx skb.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Sometimes xennet_create_queues() may failed to created all requested
queues, we need to update num_queues to real created to avoid NULL
pointer dereference.
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If netfront connects with two (or more) queues and then reconnects with
only one queue it fails to delete or rewrite the multi-queue-num-queues
key and netback will try to use the wrong number of queues.
Always write the num-queues field if the backend has multi-queue support.
Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use the correct GFN/BFN terms more consistently.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV8VRMAAoJEFxbo/MsZsTRiGQH/i/jrAJUJfrFC2PINaA2gDwe
O0dlrkCiSgAYChGmxxxXZQSPM5Po5+EbT/dLjZ/uvSooeorM9RYY/mFo7ut/qLep
4pyQUuwGtebWGBZTrj9sygUVXVhgJnyoZxskNUbhj9zvP7hb9++IiI78mzne6cpj
lCh/7Z2dgpfRcKlNRu+qpzP79Uc7OqIfDK+IZLrQKlXa7IQDJTQYoRjbKpfCtmMV
BEG3kN9ESx5tLzYiAfxvaxVXl9WQFEoktqe9V8IgOQlVRLgJ2DQWS6vmraGrokWM
3HDOCHtRCXlPhu1Vnrp0R9OgqWbz8FJnmVAndXT8r3Nsjjmd0aLwhJx7YAReO/4=
=JDia
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.3-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen terminology fixes from David Vrabel:
"Use the correct GFN/BFN terms more consistently"
* tag 'for-linus-4.3-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/xenbus: Rename the variable xen_store_mfn to xen_store_gfn
xen/privcmd: Further s/MFN/GFN/ clean-up
hvc/xen: Further s/MFN/GFN clean-up
video/xen-fbfront: Further s/MFN/GFN clean-up
xen/tmem: Use xen_page_to_gfn rather than pfn_to_gfn
xen: Use correctly the Xen memory terminologies
arm/xen: implement correctly pfn_to_mfn
xen: Make clear that swiotlb and biomerge are dealing with DMA address
Originally that parameter was always reset to num_online_cpus during
module initialisation, which renders it useless.
The fix is to only set max_queues to num_online_cpus when user has not
provided a value.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on include/xen/mm.h [1], Linux is mistakenly using MFN when GFN
is meant, I suspect this is because the first support for Xen was for
PV. This resulted in some misimplementation of helpers on ARM and
confused developers about the expected behavior.
For instance, with pfn_to_mfn, we expect to get an MFN based on the name.
Although, if we look at the implementation on x86, it's returning a GFN.
For clarity and avoid new confusion, replace any reference to mfn with
gfn in any helpers used by PV drivers. The x86 code will still keep some
reference of pfn_to_mfn which may be used by all kind of guests
No changes as been made in the hypercall field, even
though they may be invalid, in order to keep the same as the defintion
in xen repo.
Note that page_to_mfn has been renamed to xen_page_to_gfn to avoid a
name to close to the KVM function gfn_to_page.
Take also the opportunity to simplify simple construction such
as pfn_to_mfn(page_to_pfn(page)) into xen_page_to_gfn. More complex clean up
will come in follow-up patches.
[1] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=e758ed14f390342513405dd766e874934573e6cb
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
If you simply load and unload the module without starting the interfaces,
the queues are never created and you get a bad pointer dereference.
Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) mlx4 driver bug fixes (TX queue wakeups, csum complete indications)
from Ido Shamay, Eran Ben Elisha, and Or Gerlitz.
2) Missing unlock in error path of PTP support in renesas driver, from
Dan Carpenter.
3) Add Vitesse 8641 phy IDs to vitesse PHY driver, from Shaohui Xie.
4) Bnx2x driver bug fixes (linearization of encap packets, scratchpad
parity error notifications, flow-control and speed settings) from
Yuval Mintz, Manish Chopra, Shahed Shaikh, and Ariel Elior.
5) ipv6 extension header parsing in the igb chip has a HW errata,
disable it. Frm Todd Fujinaka.
6) Fix PCI link state locking issue in e1000e driver, from Yanir
Lubetkin.
7) Cure panics during MTU change in i40e, from Mitch Williams.
8) Don't leak promisc refs in DSA slave driver, from Gilad Ben-Yossef.
9) Add missing HAS_DMA dep to VIA Rhine driver, from Geery
Uytterhoeven.
10) Make sure DMA map/unmap calls are symmetric in bnx2x driver, from
Michal Schmidt.
11) Workaround for MDIO access problems in bcm7xxx devices, from FLorian
Fainelli.
12) Fix races in SCTP protocol between OTTB responses and route
removals, from Alexander Sverdlin.
13) Fix jumbo frame checksum issue with some mvneta devices, from Simon
Guinot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (58 commits)
sock_diag: don't broadcast kernel sockets
net: mvneta: disable IP checksum with jumbo frames for Armada 370
ARM: mvebu: update Ethernet compatible string for Armada XP
net: mvneta: introduce compatible string "marvell, armada-xp-neta"
api: fix compatibility of linux/in.h with netinet/in.h
net: icplus: fix typo in constant name
sis900: Trivial: Fix typos in enums
stmmac: Trivial: fix typo in constant name
sctp: Fix race between OOTB responce and route removal
net-Liquidio: Delete unnecessary checks before the function call "vfree"
vmxnet3: Bump up driver version number
amd-xgbe: Add the __GFP_NOWARN flag to Rx buffer allocation
net: phy: mdio-bcm-unimac: workaround initial read failures for integrated PHYs
net: bcmgenet: workaround initial read failures for integrated PHYs
net: phy: bcm7xxx: workaround MDIO management controller initial read
bnx2x: fix DMA API usage
net: via: VIA_RHINE and VIA_VELOCITY should depend on HAS_DMA
net/phy: tune get_phy_c45_ids to support more c45 phy
bnx2x: fix lockdep splat
net: fec: don't access RACC register when not available
...
- Add "make xenconfig" to assist in generating configs for Xen guests.
- Preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests.
- Automatically use hvc0 as the default console in ARM guests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVkpoqAAoJEFxbo/MsZsTRu3IH/2AMPx2i65hoSqfHtGf3sz/z
XNfcidVmOElFVXGaW83m0tBWMemT5LpOGRfiq5sIo8xt/8xD2vozEkl/3kkf3RrX
EmZDw3E8vmstBdBTjWdovVhNenRc0m0pB5daS7wUdo9cETq1ag1L3BHTB3fEBApO
74V6qAfnhnq+snqWhRD3XAk3LKI0nWuWaV+5HsmxDtnunGhuRLGVs7mwxZGg56sM
mILA0eApGPdwyVVpuDe0SwV52V8E/iuVOWTcomGEN2+cRWffG5+QpHxQA8bOtF6O
KfqldiNXOY/idM+5+oSm9hespmdWbyzsFqmTYz0LvQvxE8eEZtHHB3gIcHkE8QU=
=danz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Xen features and cleanups for 4.2-rc0:
- add "make xenconfig" to assist in generating configs for Xen guests
- preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests
- automatically use hvc0 as the default console in ARM guests"
* tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
block/xen-blkback: s/nr_pages/nr_segs/
block/xen-blkfront: Remove invalid comment
block/xen-blkfront: Remove unused macro MAXIMUM_OUTSTANDING_BLOCK_REQS
arm/xen: Drop duplicate define mfn_to_virt
xen/grant-table: Remove unused macro SPP
xen/xenbus: client: Fix call of virt_to_mfn in xenbus_grant_ring
xen: Include xen/page.h rather than asm/xen/page.h
kconfig: add xenconfig defconfig helper
kconfig: clarify kvmconfig is for kvm
xen/pcifront: Remove usage of struct timeval
xen/tmem: use BUILD_BUG_ON() in favor of BUG_ON()
hvc_xen: avoid uninitialized variable warning
xenbus: avoid uninitialized variable warning
xen/arm: allow console=hvc0 to be omitted for guests
arm,arm64/xen: move Xen initialization earlier
arm/xen: Correctly check if the event channel interrupt is present
The function netif_set_real_num_tx_queues() will return -EINVAL if
the second parameter < 1, so call this function with the second
parameter set to 0 is meaningless.
Signed-off-by: Liang Li <liang.z.li@intel.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rx->status is an int16_t, print it using %d rather than %u in order to
have a meaningful value when the field is negative.
Also use %u rather than %x for rx->offset.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Using xen/page.h will be necessary later for using common xen page
helpers.
As xen/page.h already include asm/xen/page.h, always use the later.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the timer API function setup_timer instead of structure field
assignments to initialize a timer.
A simplified version of the Coccinelle semantic patch that performs
this transformation is as follows:
@change@
expression e, func, da;
@@
-init_timer (&e);
+setup_timer (&e, func, da);
-e.data = da;
-e.function = func;
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xennet_remove() freed the queues before freeing the netdevice which
results in a use-after-free when free_netdev() tries to delete the
napi instances that have already been freed.
Fix this by fully destroy the queues (which includes deleting the napi
instances) before freeing the netdevice.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix verifier memory corruption and other bugs in BPF layer, from
Alexei Starovoitov.
2) Add a conservative fix for doing BPF properly in the BPF classifier
of the packet scheduler on ingress. Also from Alexei.
3) The SKB scrubber should not clear out the packet MARK and security
label, from Herbert Xu.
4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.
5) Pause handling is not correct in the stmmac driver because it
doesn't take into consideration the RX and TX fifo sizes. From
Vince Bridgers.
6) Failure path missing unlock in FOU driver, from Wang Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
net: dsa: use DEVICE_ATTR_RW to declare temp1_max
netns: remove BUG_ONs from net_generic()
IB/ipoib: Fix ndo_get_iflink
sfc: Fix memcpy() with const destination compiler warning.
altera tse: Fix network-delays and -retransmissions after high throughput.
net: remove unused 'dev' argument from netif_needs_gso()
act_mirred: Fix bogus header when redirecting from VLAN
inet_diag: fix access to tcp cc information
tcp: tcp_get_info() should fetch socket fields once
net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
skbuff: Do not scrub skb mark within the same name space
Revert "net: Reset secmark when scrubbing packet"
bpf: fix two bugs in verification logic when accessing 'ctx' pointer
bpf: fix bpf helpers to use skb->mac_header relative offsets
stmmac: Configure Flow Control to work correctly based on rxfifo size
stmmac: Enable unicast pause frame detect in GMAC Register 6
stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
stmmac: Add defines and documentation for enabling flow control
stmmac: Add properties for transmit and receive fifo sizes
stmmac: fix oops on rmmod after assigning ip addr
...
In commit 04ffcb255f ("net: Add ndo_gso_check") Tom originally
added the 'dev' argument to be able to call ndo_gso_check().
Then later, when generalizing this in commit 5f35227ea3
("net: Generalize ndo_gso_check to ndo_features_check")
Jesse removed the call to ndo_gso_check() in netif_needs_gso()
by calling the new ndo_features_check() in a different place.
This made the 'dev' argument unused.
Remove the unused argument and go back to the code as before.
Cc: Tom Herbert <therbert@google.com>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use a single source list of hypercalls, generating other tables
etc. at build time.
- Add a "Xen PV" APIC driver to support >255 VCPUs in PV guests.
- Significant performance improve to guest save/restore/migration.
- scsiback/front save/restore support.
- Infrastructure for multi-page xenbus rings.
- Misc fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVL6OZAAoJEFxbo/MsZsTRs3YH/2AycBHs129DNnc2OLmOklBz
AdD43k+FOfZlv0YU80WmPVmOpGGHGB5Pqkix2KtnvPYmtx3pb/5ikhDwSTWZpqBl
Qq6/RgsRjYZ8VMKqrMTkJMrJWHQYbg8lgsP5810nsFBn/Qdbxms+WBqpMkFVo3b2
rvUZj8QijMJPS3qr55DklVaOlXV4+sTAytTdCiubVnaB/agM2jjRflp/lnJrhtTg
yc4NTrIlD1RsMV/lNh92upBP/pCm6Bs0zQ2H1v3hkdhBBmaO0IVXpSheYhfDOHfo
9v209n137N7X86CGWImFk6m2b+EfiFnLFir07zKSA+iZwkYKn75znSdPfj0KCc0=
=bxTm
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen features and fixes from David Vrabel:
- use a single source list of hypercalls, generating other tables etc.
at build time.
- add a "Xen PV" APIC driver to support >255 VCPUs in PV guests.
- significant performance improve to guest save/restore/migration.
- scsiback/front save/restore support.
- infrastructure for multi-page xenbus rings.
- misc fixes.
* tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pci: Try harder to get PXM information for Xen
xenbus_client: Extend interface to support multi-page ring
xen-pciback: also support disabling of bus-mastering and memory-write-invalidate
xen: support suspend/resume in pvscsi frontend
xen: scsiback: add LUN of restored domain
xen-scsiback: define a pr_fmt macro with xen-pvscsi
xen/mce: fix up xen_late_init_mcelog() error handling
xen/privcmd: improve performance of MMAPBATCH_V2
xen: unify foreign GFN map/unmap for auto-xlated physmap guests
x86/xen/apic: WARN with details.
x86/xen: Provide a "Xen PV" APIC driver to support >255 VCPUs
xen/pciback: Don't print scary messages when unsupported by hypervisor.
xen: use generated hypercall symbols in arch/x86/xen/xen-head.S
xen: use generated hypervisor symbols in arch/x86/xen/trace.c
xen: synchronize include/xen/interface/xen.h with xen
xen: build infrastructure for generating hypercall depending symbols
xen: balloon: Use static attribute groups for sysfs entries
xen: pcpu: Use static attribute groups for sysfs entry
Originally Xen PV drivers only use single-page ring to pass along
information. This might limit the throughput between frontend and
backend.
The patch extends Xenbus driver to support multi-page ring, which in
general should improve throughput if ring is the bottleneck. Changes to
various frontend / backend to adapt to the new interface are also
included.
Affected Xen drivers:
* blkfront/back
* netfront/back
* pcifront/back
* scsifront/back
* vtpmfront
The interface is documented, as before, in xenbus_client.c.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.
Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.
The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.
So the maximum permitted size of an skb is calculated to be
(XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.
Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.
Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).
Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
it relates to the size of the whole packet, including the header.
Fixes: 9ecd1a75d9 ("xen-netfront: reduce gso_max_size to account for max TCP header")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of manual calls of device_create_file() and
device_remove_files(), assign the static attribute groups to netdev
groups array. This simplifies the code and avoids the possible
races.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate all the duplicate code for making Tx requests by
consolidating them into a single xennet_make_one_txreq() function.
xennet_make_one_txreq() and xennet_make_txreqs() work with pages and
offsets so it will be easier to make netfront handle highmem frags in
the future.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A function to count the number of slots an skb needs is more useful
than one that counts the slots needed for only the frags.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netfront the Rx and Tx path are independent and use different
locks. The Tx lock is held with hard irqs disabled, but Rx lock is
held with only BH disabled. Since both sides use the same stats lock,
a deadlock may occur.
[ INFO: possible irq lock inversion dependency detected ]
3.16.2 #16 Not tainted
---------------------------------------------------------
swapper/0/0 just changed the state of lock:
(&(&queue->tx_lock)->rlock){-.....}, at: [<c03adec8>]
xennet_tx_interrupt+0x14/0x34
but this lock took another, HARDIRQ-unsafe lock in the past:
(&stat->syncp.seq#2){+.-...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&stat->syncp.seq#2);
local_irq_disable();
lock(&(&queue->tx_lock)->rlock);
lock(&stat->syncp.seq#2);
<Interrupt>
lock(&(&queue->tx_lock)->rlock);
Using separate locks for the Rx and Tx stats fixes this deadlock.
Reported-by: Dmitry Piotrovsky <piotrovskydmitry@gmail.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes some unused arrays from the netfront private
data structures. These arrays were used in "flip" receive mode.
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After d75b1ade56 (net: less interrupt
masking in NAPI) the napi instance is removed from the per-cpu list
prior to calling the n->poll(), and is only requeued if all of the
budget was used. This inadvertently broke netfront because netfront
does not use NAPI correctly.
If netfront had not used all of its budget it would do a final check
for any Rx responses and avoid calling napi_complete() if there were
more responses. It would still return under budget so it would never
be rescheduled. The final check would also not re-enable the Rx
interrupt.
Additionally, xenvif_poll() would also call napi_complete() /after/
enabling the interrupt. This resulted in a race between the
napi_complete() and the napi_schedule() in the interrupt handler. The
use of local_irq_save/restore() avoided by race iff the handler is
running on the same CPU but not if it was running on a different CPU.
Fix both of these by always calling napi_compete() if the budget was
not all used, and then calling napi_schedule() if the final checks
says there's more work.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 97a6d1bb2b (xen-netfront: Fix
handling packets on compound pages with skb_linearize) attempted to
fix a problem where an skb that would have required too many slots
would be dropped causing TCP connections to stall.
However, it filled in the first slot using the original buffer and not
the new one and would use the wrong offset and grant access to the
wrong page.
Netback would notice the malformed request and stop all traffic on the
VIF, reporting:
vif vif-3-0 vif3.0: txreq.offset: 85e, size: 4002, end: 6144
vif vif-3-0 vif3.0: fatal error; disabling device
Reported-by: Anthony Wright <anthony@overnetdata.com>
Tested-by: Anthony Wright <anthony@overnetdata.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These BUGs can be erroneously triggered by frags which refer to
tail pages within a compound page. The data in these pages may
overrun the hardware page while still being contained within the
compound page, but since compound_order() evaluates to 0 for tail
pages the assertion fails. The code already iterates through
subsequent pages correctly in this scenario, so the BUGs are
unnecessary and can be removed.
Fixes: f36c374782 ("xen/netfront: handle compound page fragments on transmit")
Cc: <stable@vger.kernel.org> # 3.7+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A full Rx ring only requires 1 MiB of memory. This is not enough
memory that it is useful to dynamically scale the number of Rx
requests in the ring based on traffic rates, because:
a) Even the full 1 MiB is a tiny fraction of a typically modern Linux
VM (for example, the AWS micro instance still has 1 GiB of memory).
b) Netfront would have used up to 1 MiB already even with moderate
data rates (there was no adjustment of target based on memory
pressure).
c) Small VMs are going to typically have one VCPU and hence only one
queue.
Keeping the ring full of Rx requests handles bursty traffic better
than trying to converge on an optimal number of requests to keep
filled.
On a 4 core host, an iperf -P 64 -t 60 run from dom0 to a 4 VCPU guest
improved from 5.1 Gbit/s to 5.6 Gbit/s. Gains with more bursty
traffic are expected to be higher.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Include fixes for netrom and dsa (Fabian Frederick and Florian
Fainelli)
2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.
3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
ip_tunnel, fou), from Li ROngQing.
4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.
5) Use after free in virtio_net, from Michael S Tsirkin.
6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
Shelar.
7) ISDN gigaset and capi bug fixes from Tilman Schmidt.
8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.
9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.
10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
Wang and Eric Dumazet.
11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
chunks, from Daniel Borkmann.
12) Don't call sock_kfree_s() with NULL pointers, this function also has
the side effect of adjusting the socket memory usage. From Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
bna: fix skb->truesize underestimation
net: dsa: add includes for ethtool and phy_fixed definitions
openvswitch: Set flow-key members.
netrom: use linux/uaccess.h
dsa: Fix conversion from host device to mii bus
tipc: fix bug in bundled buffer reception
ipv6: introduce tcp_v6_iif()
sfc: add support for skb->xmit_more
r8152: return -EBUSY for runtime suspend
ipv4: fix a potential use after free in fou.c
ipv4: fix a potential use after free in ip_tunnel_core.c
hyperv: Add handling of IP header with option field in netvsc_set_hash()
openvswitch: Create right mask with disabled megaflows
vxlan: fix a free after use
openvswitch: fix a use after free
ipv4: dst_entry leak in ip_send_unicast_reply()
ipv4: clean up cookie_v4_check()
ipv4: share tcp_v4_save_options() with cookie_v4_check()
ipv4: call __ip_options_echo() in cookie_v4_check()
atm: simplify lanai.c by using module_pci_driver
...
Add ndo_gso_check which a device can define to indicate whether is
is capable of doing GSO on a packet. This funciton would be called from
the stack to determine whether software GSO is needed to be done. A
driver should populate this function if it advertises GSO types for
which there are combinations that it wouldn't be able to handle. For
instance a device that performs UDP tunneling might only implement
support for transparent Ethernet bridging type of inner packets
or might have limitations on lengths of inner headers.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DEFINE_XENBUS_DRIVER() macro looks a bit weird and causes sparse
errors.
Replace the uses with standard structure definitions instead. This is
similar to pci and usb device registration.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
There is a long known problem with the netfront/netback interface: if the guest
tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots,
it gets dropped. The reason is that netback maps these slots to a frag in the
frags array, which is limited by size. Having so many slots can occur since
compound pages were introduced, as the ring protocol slice them up into
individual (non-compound) page aligned slots. The theoretical worst case
scenario looks like this (note, skbs are limited to 64 Kb here):
linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
using 2 slots
first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the
end and the beginning of a page, therefore they use 3 * 15 = 45 slots
last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
Although I don't think this 51 slots skb can really happen, we need a solution
which can deal with every scenario. In real life there is only a few slots
overdue, but usually it causes the TCP stream to be blocked, as the retry will
most likely have the same buffer layout.
This patch solves this problem by linearizing the packet. This is not the
fastest way, and it can fail much easier as it tries to allocate a big linear
area for the whole packet, but probably easier by an order of magnitude than
anything else. Probably this code path is not touched very frequently anyway.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When less than the requested number of queues could be created, include
the actual number in the warning (instead of the requested number).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since netfront may reconnect to a backend with a different number of
queues, all per-queue Rx and Tx resources (skbs and grant references)
should be freed when disconnecting.
Without this fix, the Tx and Rx grant refs are not released and
netfront will exhaust them after only a few reconnections. netfront
will fail to connect when no free grant references are available.
Since all Rx bufs are freed and reallocated instead of reused this
will add some additional delay to the reconnection but this is
expected to be small compared to the time taken by any backend hotplug
scripts etc.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If no queues could be created when connecting to the backend, one of the
error paths would deadlock.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In xennet_disconnect_backend(), netif_carrier_off() was called once
per queue when it needs to only be called once.
The queue locking around the netif_carrier_off() call looked very
odd. I think they were supposed to synchronize any NAPI instances with
the expectation that no further NAPI instances would be scheduled
because of the carrier being off (see the check in
xennet_rx_interrupt()). But I can't easily tell if this works
correctly.
Instead, add a napi_synchronize() call after disabling the interrupts.
This is obviously correct as with no Rx interrupts, no further NAPI
instances will be scheduled.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nesting of the per-queue rx_lock and tx_lock in xennet_connect()
is confusing to both humans and lockdep. The locking is safe because
this is the only place where the locks are nested in this way but
lockdep still warns.
Instead of adding the missing lockdep annotations, refactor the
locking to avoid the confusing nesting. This is still safe, because
the xenbus connection state changes are all serialized by the xenwatch
thread.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reconnecting to the backend (after a resume/migration, for example),
a different number of queues may be required (since the guest may have
moved to a different host with different capabilities). During the
reconnection the old queues are torn down and new ones created.
Introduce xennet_create_queues() and xennet_destroy_queues() that fixes
three bugs during the reconnection.
- The old info->queues was leaked.
- The old queue's napi instances were not deleted.
- The new queue's napi instances were left disabled (which meant no
packets could be received).
The xennet_destroy_queues() calls is deferred until the reconnection
instead of the disconnection (in xennet_disconnect_backend()) because
napi_disable() might sleep.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xennet_disconnect_backend() was not correctly iterating over all the
queues.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Build on the refactoring of the previous patch to implement multiple
queues between xen-netfront and xen-netback.
Check XenStore for multi-queue support, and set up the rings and event
channels accordingly.
Write ring references and event channels to XenStore in a queue
hierarchy if appropriate, or flat when using only one queue.
Update the xennet_select_queue() function to choose the queue on which
to transmit a packet based on the skb hash result.
Signed-off-by: Andrew J. Bennieston <andrew.bennieston@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for multi-queue support in xen-netfront, move the
queue-specific data from struct netfront_info to struct netfront_queue,
and update the rest of the code to use this.
Also adds loops over queues where appropriate, even though only one is
configured at this point, and uses alloc_etherdev_mq() and the
corresponding multi-queue netif wake/start/stop functions in preparation
for multiple active queues.
Finally, implements a trivial queue selection function suitable for
ndo_select_queue, which simply returns 0, selecting the first (and
only) queue.
Signed-off-by: Andrew J. Bennieston <andrew.bennieston@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the initialization of an array used in the TX
datapath that was mistakenly initialized together with the
RX datapath arrays. An out of range array access could happen
when RX and TX rings had different sizes.
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace dev_kfree_skb with dev_kfree_skb_any in xennet_start_xmit
which can be called in hard irq and other contexts. xennet_start_xmit
only fress skbs which it drops.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
In ed1f50c3a ("net: add skb_checksum_setup") we introduced some checksum
functions in core driver. Subsequent change b5cf66cd1 ("xen-netfront:
use new skb_checksum_setup function") made use of those functions to
replace its own implementation.
However with that change netfront is broken. It sees a lot of checksum
error. That's because its own implementation of checksum function was a
bit hacky (dereferencing skb->data directly) while the new function was
implemented using ip_hdr(). The network header is not reset before skb
is passed to the new function. When the new function tries to do its
job, it's confused and reports error.
The fix is simple, we need to reset network header before passing skb to
checksum function. Netback is not affected as it already does the right
thing.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backend drivers shouldn't transistion to CLOSED unless the frontend is
CLOSED. If a backend does transition to CLOSED too soon then the
frontend may not see the CLOSING state and will not properly shutdown.
So, treat an unexpected backend CLOSED state the same as CLOSING.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes grant transfer releasing code from netfront, and uses
gnttab_end_foreign_access to end grant access since
gnttab_end_foreign_access_ref may fail when the grant entry is
currently used for reading or writing.
* clean up grant transfer code kept from old netfront(2.6.18) which grants
pages for access/map and transfer. But grant transfer is deprecated in current
netfront, so remove corresponding release code for transfer.
* fix resource leak, release grant access (through gnttab_end_foreign_access)
and skb for tx/rx path, use get_page to ensure page is released when grant
access is completed successfully.
Xen-blkfront/xen-tpmfront/xen-pcifront also have similar issue, but patches
for them will be created separately.
V6: Correct subject line and commit message.
V5: Remove unecessary change in xennet_end_access.
V4: Revert put_page in gnttab_end_foreign_access, and keep netfront change in
single patch.
V3: Changes as suggestion from David Vrabel, ensure pages are not freed untill
grant acess is ended.
V2: Improve patch comments.
Signed-off-by: Annie Li <annie.li@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
This patch adds support for IPv6 checksum offload and GSO when those
features are available in the backend.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use skb_checksum_setup to set up partial checksum offsets rather
then a private implementation.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:
xen_platform_pci=0
(in the guest config file)
or
xen_emul_unplug=never
(on the Linux command line)
except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:
input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>] [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
[<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
[<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
[<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
[<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
[<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
[<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
[<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
[<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
[<ffffffff8145e7d9>] driver_attach+0x19/0x20
[<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
[<ffffffff8145f1ff>] driver_register+0x5f/0xf0
[<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
[<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
[<ffffffffa0015000>] ? 0xffffffffa0014fff
[<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
[<ffffffff81002049>] do_one_initcall+0x49/0x170
.. snip..
which is hardly nice. This patch fixes this by having each
PV driver check for:
- if running in PV, then it is fine to execute (as that is their
native environment).
- if running in HVM, check if user wanted 'xen_emul_unplug=never',
in which case bail out and don't load any PV drivers.
- if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
does not exist, then bail out and not load PV drivers.
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
then bail out for all PV devices _except_ the block one.
Ditto for the network one ('nics').
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
then load block PV driver, and also setup the legacy IDE paths.
In (v3) make it actually load PV drivers.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
CC: stable@vger.kernel.org
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
There was a bug in xennet_alloc_rx_buffers, when allocating page or
sk_buff failed, and at the same time rx_batch queue not empty,
the rx_refill_timer timer won't be scheduled. If finally the remaining
request buffers in rx ring less than what backend driver expected,
the backend driver would think of rx ring as full and start dropping packets.
In such situation, there is no way for the netfront driver to recover
automatically, so that the device can not work properly.
The patch fixes the problem by always scheduling rx_refill_timer timer when
alloc_page or __netdev_alloc_skb fails, no matter whether rx_batch queue is
empty or not. It ensures that the rx ring request buffers will finally meet
the backend needs.
Signed-off-by: Ma JieYue <jieyue.majy@alibaba-inc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Anirban was seeing netfront received MTU size packets, which downgraded
throughput. The following patch makes netfront use GRO API which
improves throughput for that case.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Anirban Chakraborty <abchak@juniper.net>
Cc: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to commit 3683243b ("xen-netfront: use __pskb_pull_tail to ensure
linear area is big enough on RX") xennet_fill_frags() may end up
filling MAX_SKB_FRAGS + 1 fragments in a receive skb, and only reduce
the fragment count subsequently via __pskb_pull_tail(). That's a
result of xennet_get_responses() allowing a maximum of one more slot to
be consumed (and intermediately transformed into a fragment) if the
head slot has a size less than or equal to RX_COPY_THRESHOLD.
Hence we need to adjust xennet_fill_frags() to pull earlier if we
reached the maximum fragment count - due to the described behavior of
xennet_get_responses() this guarantees that at least the first fragment
will get completely consumed, and hence the fragment count reduced.
In order to not needlessly call __pskb_pull_tail() twice, make the
original call conditional upon the pull target not having been reached
yet, and defer the newly added one as much as possible (an alternative
would have been to always call the function right before the call to
xennet_fill_frags(), but that would imply more frequent cases of
needing to call it twice).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@vger.kernel.org (3.6 onwards)
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of mixing printk and pr_<level> forms,
just use pr_<level>
Miscellaneous changes around these conversions:
Add a missing newline to avoid message interleaving,
coalesce formats, reflow modified lines to 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
use skb_partial_csum_set() to simplify the codes
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new feature called feature-split-event-channels for
netfront, enabling it to handle TX and RX events separately.
If netback does not support this feature, it falls back to use single event
channel.
If netfront fails to setup split event channels, it will try falling back to
single event channel.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should correctly free related resources (grant ref, memory page, evtchn)
when setup_netfront fails.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The maximum packet including header that can be handled by netfront / netback
wire format is 65535. Reduce gso_max_size accordingly.
Drop skb and print warning when skb->len > 65535. This can 1) save the effort
to send malformed packet to netback, 2) help spotting misconfiguration of
netfront in the future.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also fix a typo in comment.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is in fact counting the ring slots required for responses.
Separate the concepts of ring slots and skb frags make the code clearer, as
now netfront and netback can have different MAX_SKB_FRAGS, slot and frag are
not mapped 1:1 any more.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This variable is supposed to hold reference to the last extra_info in the
loop. However there is only type of extra info here and the loop to process
extra info is missing, so this variable is never used and causes confusion.
Remove it at the moment. We can add it back when necessary.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
than that. We have already accounted for this in
NETFRONT_SKB_CB(skb)->pull_to so use that instead.
Fixes WARN_ON from skb_try_coalesce.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: annie li <annie.li@oracle.com>
Cc: xen-devel@lists.xen.org
Cc: netdev@vger.kernel.org
Cc: stable@kernel.org # 3.7.x only
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>