Граф коммитов

34018 Коммитов

Автор SHA1 Сообщение Дата
Ian Morris cc24becae3 ipv6: White-space cleansing : Structure layouts
This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

This patch addresses structure definitions, specifically it cleanses the brace
placement and replaces spaces with tabs in a few places.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 22:37:52 -07:00
Ian Morris 67ba4152e8 ipv6: White-space cleansing : Line Layouts
This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.

Both objdump and diff -w show no differences.

A number of items are addressed in this patch:
* Multiple spaces converted to tabs
* Spaces before tabs removed.
* Spaces in pointer typing cleansed (char *)foo etc.
* Remove space after sizeof
* Ensure spacing around comparators such as if statements.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 22:37:52 -07:00
Tom Herbert 48a5fc7731 gre: When GRE csum is present count as encap layer wrt csum
In GRE demux if the GRE checksum pop rcv encapsulation so that any
encapsulated checksums are treated as tunnel checksums.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:24 -07:00
Tom Herbert 57c67ff4bd udp: additional GRO support
Implement GRO for UDPv6. Add UDP checksum verification in gro_receive
for both UDP4 and UDP6 calling skb_gro_checksum_validate_zero_check.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:24 -07:00
Tom Herbert 149d0774a7 tcp: Call skb_gro_checksum_validate
In tcp[64]_gro_receive call skb_gro_checksum_validate to validate TCP
checksum in the gro context.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:24 -07:00
Tom Herbert 758f75d1ff gre: call skb_gro_checksum_simple_validate
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:23 -07:00
Tom Herbert 573e8fca25 net: skb_gro_checksum_* functions
Add skb_gro_checksum_validate, skb_gro_checksum_validate_zero_check,
and skb_gro_checksum_simple_validate, and __skb_gro_checksum_complete.
These are the cognates of the normal checksum functions but are used
in the gro_receive path and operate on GRO related fields in sk_buffs.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:23 -07:00
Daniel Borkmann 8fc54f6891 net: use reciprocal_scale() helper
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 12:21:21 -07:00
David S. Miller 690e36e726 net: Allow raw buffers to be passed into the flow dissector.
Drivers, and perhaps other entities we have not yet considered,
sometimes want to know how deep the protocol headers go before
deciding how large of an SKB to allocate and how much of the packet to
place into the linear SKB area.

For example, consider a driver which has a device which DMAs into
pools of pages and then tells the driver where the data went in the
DMA descriptor(s).  The driver can then build an SKB and reference
most of the data via SKB fragments (which are page/offset/length
triplets).

However at least some of the front of the packet should be placed into
the linear SKB area, which comes before the fragments, so that packet
processing can get at the headers efficiently.  The first thing each
protocol layer is going to do is a "pskb_may_pull()" so we might as
well aggregate as much of this as possible while we're building the
SKB in the driver.

Part of supporting this is that we don't have an SKB yet, so we want
to be able to let the flow dissector operate on a raw buffer in order
to compute the offset of the end of the headers.

So now we have a __skb_flow_dissect() which takes an explicit data
pointer and length.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 12:13:41 -07:00
Jon Paul Maloy 301bae56f2 tipc: merge struct tipc_port into struct tipc_sock
We complete the merging of the port and socket layer by aggregating
the fields of struct tipc_port directly into struct tipc_sock, and
moving the combined structure into socket.c.

We also move all functions and macros that are not any longer
exposed to the rest of the stack into socket.c, and rename them
accordingly.

Despite the size of this commit, there are no functional changes.
We have only made such changes that are necessary due of the removal
of struct tipc_port.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:35 -07:00
Jon Paul Maloy 808d90f9c5 tipc: remove files ref.h and ref.c
The reference table is now 'socket aware' instead of being generic,
and has in reality become a socket internal table. In order to be
able to minimize the API exposed by the socket layer towards the rest
of the stack, we now move the reference table definitions and functions
into the file socket.c, and rename the functions accordingly.

There are no functional changes in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:35 -07:00
Jon Paul Maloy 2e84c60b77 tipc: remove include file port.h
We move the inline functions in the file port.h to socket.c, and modify
their names accordingly.

We move struct tipc_port and some macros to socket.h.

Finally, we remove the file port.h.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:35 -07:00
Jon Paul Maloy 0fc87aaebd tipc: remove source file port.c
In this commit, we move the remaining functions in port.c to
socket.c, and give them new names that correspond to their new
location. We then remove the file port.c.

There are only cosmetic changes to the moved functions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:35 -07:00
Jon Paul Maloy 6c9808ce09 tipc: remove port_lock
In previous commits we have reduced usage of port_lock to a minimum,
and complemented it with usage of bh_lock_sock() at the remaining
locations. The purpose has been to remove this lock altogether, since
it largely duplicates the role of bh_lock_sock. We are now ready to do
this.

However, we still need to protect the BH callers from inadvertent
release of the socket while they hold a reference to it. We do this by
replacing port_lock by a combination of a rw-lock protecting the
reference table as such, and updating the socket reference counter while
the socket is referenced from BH. This technique is more standard and
comprehensible than the previous approach, and turns out to have a
positive effect on overall performance.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 9b50fd087a tipc: replace port pointer with socket pointer in registry
In order to make tipc_sock the only entity referencable from other
parts of the stack, we add a tipc_sock pointer instead of a tipc_port
pointer to the registry. As a consequence, we also let the function
tipc_port_lock() return a pointer to a tipc_sock instead  of a tipc_port.
We keep the function's name for now, since the lock still is owned by
the port.

This is another step in the direction of eliminating port_lock, replacing
its usage with lock_sock() and bh_lock_sock().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 5a9ee0be33 tipc: use registry when scanning sockets
The functions tipc_port_get_ports() and tipc_port_reinit() scan over
all sockets/ports to access each of them. This is done by using a
dedicated linked list, 'tipc_socks' where all sockets are members. The
list is in turn protected by a spinlock, 'port_list_lock', while each
socket is locked by using port_lock at the moment of access.

In order to reduce complexity and risk of deadlock, we want to get
rid of the linked list and the accompanying spinlock.

This is what we do in this commit. Instead of the linked list, we use
the port registry to scan across the sockets. We also add usage of
bh_lock_sock() inside the scope of port_lock in both functions, as a
preparation for the complete removal of port_lock.

Finally, we move the functions from port.c to socket.c, and rename them
to tipc_sk_sock_show() and tipc_sk_reinit() repectively.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 5b8fa7ce82 tipc: eliminate functions tipc_port_init and tipc_port_destroy
After the latest changes to the socket/port layer the existence of
the functions tipc_port_init() and tipc_port_destroy() cannot be
justified. They are both called only once, from tipc_sk_create() and
tipc_sk_delete() respectively, and their functionality can better be
merged into the latter two functions.

This also entails that all remaining references to port_lock now are
made from inside socket.c, something that will make it easier to remove
this lock.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 739f5e4efc tipc: redefine message acknowledge function
The function tipc_acknowledge() is a remnant from the obsolete native
API. Currently, it grabs port_lock, before building an acknowledge
message and sending it to the peer.

Since all access to socket members now is protected by the socket lock,
it has become unnecessary to grab port_lock here.

In this commit, we remove the usage of port_lock, simplify the
function, and move it to socket.c, renaming it to tipc_sk_send_ack().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy dadebc0029 tipc: eliminate port_connect()/port_disconnect() functions
tipc_port_connect()/tipc_port_disconnect() are remnants of the obsolete
native API. Their only task is to grab port_lock and call the functions
__tipc_port_connect()/__tipc_port_disconnect() respectively, which will
perform the actual state change.

Since socket/port exection now is single-threaded the use of port_lock
is not needed any more, so we can safely replace the two functions with
their lock-free counterparts.

In this commit, we remove the two functions. Furthermore, the contents
of __tipc_port_disconnect() is so trivial that we choose to eliminate
that function too, expanding its functionality into tipc_shutdown().
__tipc_port_connect() is simplified, moved to socket.c, and given the
more correct name tipc_sk_finish_conn(). Finally, we eliminate the
function auto_connect(), and expand its contents into filter_connect().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 80e44c2225 tipc: eliminate function tipc_port_shutdown()
tipc_port_shutdown() is a remnant from the now obsolete native
interface. As such it grabs port_lock in order to protect itself
from concurrent BH processing.

However, after the recent changes to the port/socket upcalls, sockets
are now basically single-threaded, and all execution, except the read-only
tipc_sk_timer(), is executing within the protection of lock_sock(). So
the use of port_lock is not needed here.

In this commit we eliminate the whole function, and merge it into its
only caller, tipc_shutdown().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:34 -07:00
Jon Paul Maloy 5728901581 tipc: clean up socket timer function
The last remaining BH upcall to the socket, apart for the message
reception function tipc_sk_rcv(), is the timer function.

We prefer to let this function continue executing in BH, since it only
does read-acces to semi-permanent data, but we make three changes to it:

1) We introduce a bh_lock_sock()/bh_unlock_sock() inside the scope
   of port_lock.  This is a preparation for replacing port_lock with
   bh_lock_sock() at the locations where it is still used.

2) We move the function from port.c to socket.c, as a further step
   of eliminating the port code level altogether.

3) We let it make use of the newly introduced tipc_msg_create()
   function. This enables us to get rid of three context specific
   functions (port_create_self_abort_msg() etc.) in port.c

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:33 -07:00
Jon Paul Maloy 02be61a981 tipc: use message to abort connections when losing contact to node
In the current implementation, each 'struct tipc_node' instance keeps
a linked list of those ports/sockets that are connected to the node
represented by that struct. The purpose of this is to let the node
object know which sockets to alert when it loses contact with its peer
node, i.e., which sockets need to have their connections aborted.

This entails an unwanted direct reference from the node structure
back to the port/socket structure, and a need to grab port_lock
when we have to make an upcall to the port. We want to get rid of
this unecessary BH entry point into the socket, and also eliminate
its use of port_lock.

In this commit, we instead let the node struct keep list of "connected
socket" structs, which each represents a connected socket, but is
allocated independently by the node at the moment of connection. If
the node loses contact with its peer node, the list is traversed, and
a "connection abort" message is created for each entry in the list. The
message is sent to it respective connected socket using the ordinary
data path, and the receiving socket aborts its connections upon reception
of the message.

This enables us to get rid of the direct reference from 'struct node' to
´struct port', and another unwanted BH access point to the latter.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:33 -07:00
Jon Paul Maloy 50100a5e39 tipc: use pseudo message to wake up sockets after link congestion
The current link implementation keeps a linked list of blocked ports/
sockets that is populated when there is link congestion. The purpose
of this is to let the link know which users to wake up when the
congestion abates.

This adds unnecessary complexity to the data structure and the code,
since it forces us to involve the link each time we want to delete
a socket. It also forces us to grab the spinlock port_lock within
the scope of node_lock. We want to get rid of this direct dependence,
as well as the deadlock hazard resulting from the usage of port_lock.

In this commit, we instead let the link keep list of a "wakeup" pseudo
messages for use in such situations. Those messages are sent to the
pending sockets via the ordinary message reception path, and wake up
the socket's owner when they are received.

This enables us to get rid of the 'waiting_ports' linked lists in struct
tipc_port that manifest this direct reference. As a consequence, we can
eliminate another BH entry into the socket, and hence the need to grab
port_lock. This is a further step in our effort to remove port_lock
altogether.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:33 -07:00
Jon Paul Maloy 1dd0bd2b14 tipc: introduce new function tipc_msg_create()
The function tipc_msg_init() has turned out to be of limited value
in many cases. It take too few parameters to be usable for creating
a complete message, it makes too many assumptions about what the
message should be used for, and it does not allocate any buffer to
be returned to the caller.

Therefore, we now introduce the new function tipc_msg_create(), which
takes all the parameters needed to create a full message, and returns
a buffer of the requested size. The new function will be very useful
for the changes we will be doing in later commits in this series.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:18:33 -07:00
David S. Miller f9474ddfaa Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pulling to get some TIPC fixes that a net-next series depends
upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 11:12:08 -07:00
Yuchung Cheng 989e04c5bc tcp: improve undo on timeout
Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
disabled if any retransmits were still in flight.  The concern was
perhaps that spurious retransmission sent in a previous recovery
episode may trigger DSACKs to falsely undo the current recovery.

However, this inadvertently misses undo opportunities (using either
TCP timestamps or DSACKs) when timeout occurs during a loss episode,
i.e.  recurring timeouts or timeout during fast recovery. In these
cases some retransmissions will be in flight but we should allow
undo. Furthermore, we should only reset undo_marker and undo_retrans
upon timeout if we are starting a new recovery episode. Finally,
when we do reset our undo state, we now do so in a manner similar
to tcp_enter_recovery(), so that we require a DSACK for each of
the outstsanding retransmissions. This will achieve the original
goal by requiring that we receive the same number of DSACKs as
retransmissions.

This patch increases the undo events by 50% on Google servers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 21:28:02 -07:00
Eric Dumazet 884cf705c7 net: remove dead code after sk_data_ready change
As a followup to commit 676d23690f ("net: Fix use after free by
removing length arg from sk_data_ready callbacks"), we can remove
some useless code in sock_queue_rcv_skb() and rxrpc_queue_rcv_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 21:08:50 -07:00
Eric Dumazet d2de875c6d net: use ktime_get_ns() and ktime_get_real_ns() helpers
ktime_get_ns() replaces ktime_to_ns(ktime_get())

ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 19:57:23 -07:00
Michal Kazior 47e4df94d1 mac80211: fix channel switch for chanctx-based drivers
The new_ctx pointer is set only for non-chanctx drivers.  This yielded a
crash for chanctx-based drivers during channel switch finalization:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
  IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]

Use an adequate chanctx pointer to fix this.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-22 14:45:49 -07:00
Himangi Saraogi c0b802367b af_decnet: Use time_after_eq
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.

A simplified version of the Coccinelle semantic patch making this change
is as follows:

@change@
expression E1,E2,E3;
@@
- jiffies - E1 >= (E2*E3)
+ time_after_eq(jiffies, E1+E2*E3)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:11 -07:00
Himangi Saraogi 8b1b1eb521 decnet: Use time_after_eq
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.

A simplified version of the Coccinelle semantic patch making this change
is as follows:

@change@
expression E1,E2;
@@
- (jiffies - E1) >= E2
+ time_after_eq(jiffies, E1+E2)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:11 -07:00
Himangi Saraogi c72c95a064 ipconfig: Use time_before
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.

A simplified version of the Coccinelle semantic patch making this change
is as follows:

@change@
expression E1,E2;
@@
- jiffies - E1 < E2
+ time_before(jiffies, E1+E2)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:11 -07:00
Himangi Saraogi b5c5c36d36 dn_dev: Use time_before
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.

A simplified version of the Coccinelle semantic patch making this change
is as follows:

@change@
expression E1,E2;
@@

(
- (jiffies - E1) < E2
+ time_before(jiffies, E1+E2)
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:11 -07:00
Andreea-Cristina Bernat 0932997e34 br_multicast: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:11 -07:00
Andreea-Cristina Bernat 8c6b00c816 net/openvswitch/flow.c: Replace rcu_dereference() with rcu_access_pointer()
The "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes the replacement.

The following Coccinelle semantic patch was used:
@@
@@

(
 if(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
|
 while(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:10 -07:00
Andreea-Cristina Bernat e6b688838e net/ipv4/igmp.c: Replace rcu_dereference() with rcu_access_pointer()
The "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes the replacement.

The following Coccinelle semantic patch was used:
@@
@@

(
 if(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
|
 while(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:10 -07:00
Sébastien Barré 1dced6a854 ipv4: Restore accept_local behaviour in fib_validate_source()
Commit 7a9bc9b81a ("ipv4: Elide fib_validate_source() completely when possible.")
introduced a short-circuit to avoid calling fib_validate_source when not
needed. That change took rp_filter into account, but not accept_local.
This resulted in a change of behaviour: with rp_filter and accept_local
off, incoming packets with a local address in the source field should be
dropped.

Here is how to reproduce the change pre/post 7a9bc9b81a commit:
-configure the same IPv4 address on hosts A and B.
-try to send an ARP request from B to A.
-The ARP request will be dropped before that commit, but accepted and answered
after that commit.

This adds a check for ACCEPT_LOCAL, to maintain full
fib validation in case it is 0. We also leave __fib_validate_source() earlier
when possible, based on the same check as fib_validate_source(), once the
accept_local stuff is verified.

Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 12:23:10 -07:00
Daniel Borkmann aa4a83ee8b net: sctp: fix suboptimal edge-case on non-active active/retrans path selection
In SCTP, selection of active (T.ACT) and retransmission (T.RET)
transports is being done whenever transport control operations
(UP, DOWN, PF, ...) are engaged through sctp_assoc_control_transport().

Commits 4c47af4d5e ("net: sctp: rework multihoming retransmission
path selection to rfc4960") and a7288c4dd5 ("net: sctp: improve
sctp_select_active_and_retran_path selection") have both improved
it towards a more fine-grained and optimal path selection.

Currently, the selection algorithm for T.ACT and T.RET is as follows:

1) Elect the two most recently used ACTIVE transports T1, T2 for
   T.ACT, T.RET, where T.ACT<-T1 and T1 is most recently used
2) In case primary path T.PRI not in {T1, T2} but ACTIVE, set
   T.ACT<-T.PRI and T.RET<-T1
3) If only T1 is ACTIVE from the set, set T.ACT<-T1 and T.RET<-T1
4) If none is ACTIVE, set T.ACT<-best(T.PRI, T.RET, T3) where
   T3 is the most recently used (if avail) in PF, set T.RET<-T.PRI

Prior to above commits, 4) was simply a camp on T.ACT<-T.PRI and
T.RET<-T.PRI, ignoring possible paths in PF. Camping on T.PRI is
still slightly suboptimal as it can lead to the following scenario:

Setup:
        <A>                                <B>
    T1: p1p1 (10.0.10.10) <==>  .'`)  <==> p1p1 (10.0.10.12)  <= T.PRI
    T2: p1p2 (10.0.10.20) <==> (_ . ) <==> p1p2 (10.0.10.22)

    net.sctp.rto_min = 1000
    net.sctp.path_max_retrans = 2
    net.sctp.pf_retrans = 0
    net.sctp.hb_interval = 1000

T.PRI is permanently down, T2 is put briefly into PF state (e.g. due to
link flapping). Here, the first time transmission is sent over PF path
T2 as it's the only non-INACTIVE path, but the retransmitted data-chunks
are sent over the INACTIVE path T1 (T.PRI), which is not good.

After the patch, it's choosing better transports in both cases by
modifying step 4):

4) If none is ACTIVE, set T.ACT_new<-best(T.ACT_old, T3) where T3 is
   the most recently used (if avail) in PF, set T.RET<-T.ACT_new

This will still select a best possible path in PF if available (which
can also include T.PRI/T.RET), and set both T.ACT/T.RET to it.

In case sctp_assoc_control_transport() *just* put T.ACT_old into INACTIVE
as it transitioned from ACTIVE->PF->INACTIVE and stays in INACTIVE just
for a very short while before going back ACTIVE, it will guarantee that
this path will be reselected for T.ACT/T.RET since T3 (PF) is not
available.

Previously, this was not possible, as we would only select between T.PRI
and T.RET, and a possible T3 would be NULL due to the fact that we have
just transitioned T3 in sctp_assoc_control_transport() from PF->INACTIVE
and would select a suboptimal path when T.PRI/T.RET have worse properties.

In the case that T.ACT_old permanently went to INACTIVE during this
transition and there's no PF path available, plus T.PRI and T.RET are
INACTIVE as well, we would now camp on T.ACT_old, but if everything is
being INACTIVE there's really not much we can do except hoping for a
successful HB to bring one of the transports back up again and, thus
cause a new selection through sctp_assoc_control_transport().

Now both tests work fine:

Case 1:

 1. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

 2. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(PF)

 3. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(INACTIVE)

 5. T1 S(PF) T.ACT, T.RET
    T2 S(INACTIVE)

[ 5.1 T1 S(INACTIVE) T.ACT, T.RET
      T2 S(INACTIVE) ]

 6. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(INACTIVE)

 7. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

Case 2:

 1. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

 2. T1 S(PF)
    T2 S(ACTIVE) T.ACT, T.RET

 3. T1 S(INACTIVE)
    T2 S(ACTIVE) T.ACT, T.RET

 5. T1 S(INACTIVE)
    T2 S(PF) T.ACT, T.RET

[ 5.1 T1 S(INACTIVE)
      T2 S(INACTIVE) T.ACT, T.RET ]

 6. T1 S(INACTIVE)
    T2 S(ACTIVE) T.ACT, T.RET

 7. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 11:31:30 -07:00
Daniel Borkmann ea4f19c1f8 net: sctp: spare unnecessary comparison in sctp_trans_elect_best
When both transports are the same, we don't have to go down that
road only to realize that we will return the very same transport.
We are guaranteed that curr is always non-NULL. Therefore, just
short-circuit this special case.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 11:31:30 -07:00
Jiri Benc 2ba5af42a7 openvswitch: fix panic with multiple vlan headers
When there are multiple vlan headers present in a received frame, the first
one is put into vlan_tci and protocol is set to ETH_P_8021Q. Anything in the
skb beyond the VLAN TPID may be still non-linear, including the inner TCI
and ethertype. While ovs_flow_extract takes care of IP and IPv6 headers, it
does nothing with ETH_P_8021Q. Later, if OVS_ACTION_ATTR_POP_VLAN is
executed, __pop_vlan_tci pulls the next vlan header into vlan_tci.

This leads to two things:

1. Part of the resulting ethernet header is in the non-linear part of the
   skb. When eth_type_trans is called later as the result of
   OVS_ACTION_ATTR_OUTPUT, kernel BUGs in __skb_pull. Also, __pop_vlan_tci
   is in fact accessing random data when it reads past the TPID.

2. network_header points into the ethernet header instead of behind it.
   mac_len is set to a wrong value (10), too.

Reported-by: Yulong Pei <ypei@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 11:24:04 -07:00
Benjamin Block 793c3b4000 net: ipv6: fib: don't sleep inside atomic lock
The function fib6_commit_metrics() allocates a piece of memory in mode
GFP_KERNEL while holding an atomic lock from higher up in the stack, in
the function __ip6_ins_rt(). This produces the following BUG:

> BUG: sleeping function called from invalid context at mm/slub.c:1250
> in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: dhcpcd
> 2 locks held by dhcpcd/2909:
>  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81978e67>] rtnl_lock+0x17/0x20
>  #1:  (&tb->tb6_lock){++--+.}, at: [<ffffffff81a6951a>] ip6_route_add+0x65a/0x800
> CPU: 1 PID: 2909 Comm: dhcpcd Not tainted 3.17.0-rc1 #1
> Hardware name: ASUS All Series/Q87T, BIOS 0216 10/16/2013
>  0000000000000008 ffff8800c8f13858 ffffffff81af135a 0000000000000000
>  ffff880212202430 ffff8800c8f13878 ffffffff810f8d3a ffff880212202c98
>  0000000000000010 ffff8800c8f138c8 ffffffff8121ad0e 0000000000000001
> Call Trace:
>  [<ffffffff81af135a>] dump_stack+0x4e/0x68
>  [<ffffffff810f8d3a>] __might_sleep+0x10a/0x120
>  [<ffffffff8121ad0e>] kmem_cache_alloc_trace+0x4e/0x190
>  [<ffffffff81a6bcd6>] ? fib6_commit_metrics+0x66/0x110
>  [<ffffffff81a6bcd6>] fib6_commit_metrics+0x66/0x110
>  [<ffffffff81a6cbf3>] fib6_add+0x883/0xa80
>  [<ffffffff81a6951a>] ? ip6_route_add+0x65a/0x800
>  [<ffffffff81a69535>] ip6_route_add+0x675/0x800
>  [<ffffffff81a68f2a>] ? ip6_route_add+0x6a/0x800
>  [<ffffffff81a6990c>] inet6_rtm_newroute+0x5c/0x80
>  [<ffffffff8197cf01>] rtnetlink_rcv_msg+0x211/0x260
>  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
>  [<ffffffff81119708>] ? lock_release_holdtime+0x28/0x180
>  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
>  [<ffffffff8197ccf0>] ? __rtnl_unlock+0x20/0x20
>  [<ffffffff819a989e>] netlink_rcv_skb+0x6e/0xd0
>  [<ffffffff81978ee5>] rtnetlink_rcv+0x25/0x40
>  [<ffffffff819a8e59>] netlink_unicast+0xd9/0x180
>  [<ffffffff819a9600>] netlink_sendmsg+0x700/0x770
>  [<ffffffff81103735>] ? local_clock+0x25/0x30
>  [<ffffffff8194e83c>] sock_sendmsg+0x6c/0x90
>  [<ffffffff811f98e3>] ? might_fault+0xa3/0xb0
>  [<ffffffff8195ca6d>] ? verify_iovec+0x7d/0xf0
>  [<ffffffff8194ec3e>] ___sys_sendmsg+0x37e/0x3b0
>  [<ffffffff8111ef15>] ? trace_hardirqs_on_caller+0x185/0x220
>  [<ffffffff81af979e>] ? mutex_unlock+0xe/0x10
>  [<ffffffff819a55ec>] ? netlink_insert+0xbc/0xe0
>  [<ffffffff819a65e5>] ? netlink_autobind.isra.30+0x125/0x150
>  [<ffffffff819a6520>] ? netlink_autobind.isra.30+0x60/0x150
>  [<ffffffff819a84f9>] ? netlink_bind+0x159/0x230
>  [<ffffffff811f989a>] ? might_fault+0x5a/0xb0
>  [<ffffffff8194f25e>] ? SYSC_bind+0x7e/0xd0
>  [<ffffffff8194f8cd>] __sys_sendmsg+0x4d/0x80
>  [<ffffffff8194f912>] SyS_sendmsg+0x12/0x20
>  [<ffffffff81afc692>] system_call_fastpath+0x16/0x1b

Fixing this by replacing the mode GFP_KERNEL with GFP_ATOMIC.

Signed-off-by: Benjamin Block <bebl@mageta.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 10:54:49 -07:00
zhuyj 061079ac0b sctp: not send SCTP_PEER_ADDR_CHANGE notifications with failed probe
Since the transport has always been in state SCTP_UNCONFIRMED, it
therefore wasn't active before and hasn't been used before, and it
always has been, so it is unnecessary to bug the user with a
notification.

Reported-by: Deepak Khandelwal <khandelwal.deepak.1987@gmail.com>
Suggested-by: Vlad Yasevich <vyasevich@gmail.com>
Suggested-by: Michael Tuexen <tuexen@fh-muenster.de>
Suggested-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-21 21:33:17 -07:00
Eric Dumazet dc808110bb packet: handle too big packets for PACKET_V3
af_packet can currently overwrite kernel memory by out of bound
accesses, because it assumed a [new] block can always hold one frame.

This is not generally the case, even if most existing tools do it right.

This patch clamps too long frames as API permits, and issue a one time
error on syslog.

[  394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82

In this example, packet header tp_snaplen was set to 3966,
and tp_len was set to 5042 (skb->len)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-21 16:44:28 -07:00
chas williams - CONTRACTOR 6df378d2d1 lec: Use rtnl lock/unlock when updating MTU
The LECS response contains the MTU that should be used.  Correctly
synchronize with other layers when updating.

Signed-off-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-21 16:31:23 -07:00
David S. Miller 02784f1b05 tipc: Fix build.
Missing semicolon in range check fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-19 11:16:38 -07:00
Vasily Averin 7201c1ddf7 cbq: now_rt removal
Now q->now_rt is identical to q->now and is not required anymore.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-19 10:58:44 -07:00
Vasily Averin 73d0f37ac4 cbq: incorrectly low bandwidth setting blocks limited traffic
Mainstream commit f0f6ee1f70 ("cbq: incorrect processing of high limits")
have side effect: if cbq bandwidth setting is less than real interface
throughput non-limited traffic can delay limited traffic for a very long time.

This happen because of q->now changes incorrectly in cbq_dequeue():
in described scenario L2T is much greater than real time delay,
and q->now gets an extra boost for each transmitted packet.

Accumulated boost prevents update q->now, and blocked class can wait
very long time until (q->now >= cl->undertime) will be true again.

To fix the problem the patch updates q->now on each cbq_update() call.
L2T-related pre-modification q->now was moved to cbq_update().

My testing confirmed that it fixes the problem and did not discover
any side-effects

Fixes: f0f6ee1f70 ("cbq: incorrect processing of high limits")

Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-19 10:58:44 -07:00
Erik Hugne ac32c7f705 tipc: fix message importance range check
Commit 3b4f302d85 ("tipc: eliminate
redundant locking") introduced a bug by removing the sanity check
for message importance, allowing programs to assign any value to
the msg_user field. This will mess up the packet reception logic
and may cause random link resets.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-16 20:17:34 -07:00
Sven Eckelmann e050dbeb0d batman-adv: Fix parameter order of hlist_add_behind
1d023284c3 ("list: fix order of arguments for
hlist_add_after(_rcu)") was incorrectly rebased on top of
d9124268d8 ("batman-adv: Fix out-of-order
fragmentation support"). The parameter order change of the rebased patch was
not re-applied as expected. This causes a memory leak and can cause crashes
when out-of-order packets are received. hlist_add_behind will try to access the
uninitalized list pointers of frag_entry_new to find the previous/next entry
and may modify/read random memory locations.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-16 19:19:08 -07:00
Thomas Graf 9ce12eb16f netlink: Annotate RCU locking for seq_file walker
Silences the following sparse warnings:
net/netlink/af_netlink.c:2926:21: warning: context imbalance in 'netlink_seq_start' - wrong count at exit
net/netlink/af_netlink.c:2972:13: warning: context imbalance in 'netlink_seq_stop' - unexpected unlock

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 15:13:40 -07:00
Neal Cardwell 0c9ab09223 tcp: fix ssthresh and undo for consecutive short FRTO episodes
Fix TCP FRTO logic so that it always notices when snd_una advances,
indicating that any RTO after that point will be a new and distinct
loss episode.

Previously there was a very specific sequence that could cause FRTO to
fail to notice a new loss episode had started:

(1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
(2) receiver ACKs packet 1
(3) FRTO sends 2 more packets
(4) RTO timer fires again (should start a new loss episode)

The problem was in step (3) above, where tcp_process_loss() returned
early (in the spot marked "Step 2.b"), so that it never got to the
logic to clear icsk_retransmits. Thus icsk_retransmits stayed
non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
icsk_retransmits, decide that this RTO is not a new episode, and
decide not to cut ssthresh and remember the current cwnd and ssthresh
for undo.

There were two main consequences to the bug that we have
observed. First, ssthresh was not decreased in step (4). Second, when
there was a series of such FRTO (1-4) sequences that happened to be
followed by an FRTO undo, we would restore the cwnd and ssthresh from
before the entire series started (instead of the cwnd and ssthresh
from before the most recent RTO). This could result in cwnd and
ssthresh being restored to values much bigger than the proper values.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:55 -07:00
Hannes Frederic Sowa a26552afe8 tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic
tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
ordering of incoming connections and teardowns without the need to
hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
the last measured RTO. To do so, we keep the last seen timestamp in a
per-host indexed data structure and verify if the incoming timestamp
in a connection request is strictly greater than the saved one during
last connection teardown. Thus we can verify later on that no old data
packets will be accepted by the new connection.

During moving a socket to time-wait state we already verify if timestamps
where seen on a connection. Only if that was the case we let the
time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
will be used. But we don't verify this on incoming SYN packets. If a
connection teardown was less than TCP_PAWS_MSL seconds in the past we
cannot guarantee to not accept data packets from an old connection if
no timestamps are present. We should drop this SYN packet. This patch
closes this loophole.

Please note, this patch does not make tcp_tw_recycle in any way more
usable but only adds another safety check:
Sporadic drops of SYN packets because of reordering in the network or
in the socket backlog queues can happen. Users behing NAT trying to
connect to a tcp_tw_recycle enabled server can get caught in blackholes
and their connection requests may regullary get dropped because hosts
behind an address translator don't have synchronized tcp timestamp clocks.
tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.

In general, use of tcp_tw_recycle is disadvised.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Neal Cardwell 4fab907195 tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().

Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Shmulik Ladkani bc8fc7b8f8 sit: Fix ipip6_tunnel_lookup device matching criteria
As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"),
when looking up a tunnel, tunnel's underlying interface (t->parms.link)
is verified to match incoming traffic's ingress device.

However the comparison was incorrectly based on skb->dev->iflink.

Instead, dev->ifindex should be used, which correctly represents the
interface from which the IP stack hands the ipip6 packets.

This allows setting up sit tunnels bound to vlan interfaces (otherwise
incoming ipip6 traffic on the vlan interface was dropped due to
ipip6_tunnel_lookup match failure).

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Andrey Vagin 9d186cac7f tcp: don't use timestamp from repaired skb-s to calculate RTT (v2)
We don't know right timestamp for repaired skb-s. Wrong RTT estimations
isn't good, because some congestion modules heavily depends on it.

This patch adds the TCPCB_REPAIRED flag, which is included in
TCPCB_RETRANS.

Thanks to Eric for the advice how to fix this issue.

This patch fixes the warning:
[  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
[  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
[  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
[  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
[  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
[  879.569982] Call Trace:
[  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
[  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
[  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
[  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
[  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
[  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
[  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
[  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
[  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
[  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
[  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
[  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
[  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
[  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---

v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
    where it's set for other skb-s.

Fixes: 431a91242d ("tcp: timestamp SYN+DATA messages")
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Willem de Bruijn 490cc7d03c net-timestamp: fix missing tcp fragmentation cases
Bytestream timestamps are correlated with a single byte in the skbuff,
recorded in skb_shinfo(skb)->tskey. When fragmenting skbuffs, ensure
that the tskey is set for the fragment in which the tskey falls
(seqno <= tskey < end_seqno).

The original implementation did not address fragmentation in
tcp_fragment or tso_fragment. Add code to inspect the sequence numbers
and move both tskey and the relevant tx_flags if necessary.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:06:06 -07:00
Willem de Bruijn 712a72213f net-timestamp: fix missing ACK timestamp
ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
can break out early, causing the timestamp code to be skipped. Move
the code up before the break.

Reported-by: David S. Miller <davem@davemloft.net>

Also fix a boundary condition: tp->snd_una is the next unacknowledged
byte and between tests inclusive (a <= b <= c), so generate a an ACK
timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:06:06 -07:00
Maks Naumov efd5029010 irda: Fix rd_frame control field initialization in irlap_send_rd_frame()
Signed-off-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:05:52 -07:00
chas williams - CONTRACTOR 8356f9d564 lec: Fix bug introduced by b67bfe0d42
b67bfe0d42 (hlist: drop the node
parameter from iterators) dropped the node parameter from
iterators which lec_tbl_walk() was using to iterate the list.

Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:04:46 -07:00
chas williams - CONTRACTOR de713b5794 atm/svc: Fix blocking in wait loop
One should not call blocking primitives inside a wait loop, since both
require task_struct::state to sleep, so the inner will destroy the
outer state.

sigd_enq() will possibly sleep for alloc_skb().  Move sigd_enq() before
prepare_to_wait() to avoid sleeping while waiting interruptibly.  You do
not actually need to call sigd_enq() after the initial prepare_to_wait()
because we test the termination condition before calling schedule().

Based on suggestions from Peter Zijlstra.

Signed-off-by: Chas Williams <chas@cmf.n4rl.navy.mil>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:04:46 -07:00
Christoph Jaeger 3791b3f6fb openvswitch: Fix memory leak in ovs_vport_alloc() error path
ovs_vport_alloc() bails out without freeing the memory 'vport' points to.

Picked up by Coverity - CID 1230503.

Fixes: 5cd667b0a4 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-13 20:04:46 -07:00
Linus Torvalds f0094b28f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several networking final fixes and tidies for the merge window:

   1) Changes during the merge window unintentionally took away the
      ability to build bluetooth modular, fix from Geert Uytterhoeven.

   2) Several phy_node reference count bug fixes from Uwe Kleine-König.

   3) Fix ucc_geth build failures, also from Uwe Kleine-König.

   4) Fix klog false positivies when netlink messages go to network
      taps, by properly resetting the network header.  Fix from Daniel
      Borkmann.

   5) Sizing estimate of VF netlink messages is too small, from Jiri
      Benc.

   6) New APM X-Gene SoC ethernet driver, from Iyappan Subramanian.

   7) VLAN untagging is erroneously dependent upon whether the VLAN
      module is loaded or not, but there are generic dependencies that
      matter wrt what can be expected as the SKB enters the stack.
      Make the basic untagging generic code, and do it unconditionally.
      From Vlad Yasevich.

   8) xen-netfront only has so many slots in it's transmit queue so
      linearize packets that have too many frags.  From Zoltan Kiss.

   9) Fix suspend/resume PHY handling in bcmgenet driver, from Florian
      Fainelli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (55 commits)
  net: bcmgenet: correctly resume adapter from Wake-on-LAN
  net: bcmgenet: update UMAC_CMD only when link is detected
  net: bcmgenet: correctly suspend and resume PHY device
  net: bcmgenet: request and enable main clock earlier
  net: ethernet: myricom: myri10ge: myri10ge.c: Cleaning up missing null-terminate after strncpy call
  xen-netfront: Fix handling packets on compound pages with skb_linearize
  net: fec: Support phys probed from devicetree and fixed-link
  smsc: replace WARN_ON() with WARN_ON_SMP()
  xen-netback: Don't deschedule NAPI when carrier off
  net: ethernet: qlogic: qlcnic: Remove duplicate object file from Makefile
  wan: wanxl: Remove typedefs from struct names
  m68k/atari: EtherNEC - ethernet support (ne)
  net: ethernet: ti: cpmac.c: Cleaning up missing null-terminate after strncpy call
  hdlc: Remove typedefs from struct names
  airo_cs: Remove typedef local_info_t
  atmel: Remove typedef atmel_priv_ioctl
  com20020_cs: Remove typedef com20020_dev_t
  ethernet: amd: Remove typedef local_info_t
  net: Always untag vlan-tagged traffic on input.
  drivers: net: Add APM X-Gene SoC ethernet driver support.
  ...
2014-08-13 18:27:40 -06:00
Linus Torvalds 06b8ab5528 NFS client updates for Linux 3.17
Highlights include:
 
 - Stable fix for a bug in nfs3_list_one_acl()
 - Speed up NFS path walks by supporting LOOKUP_RCU
 - More read/write code cleanups
 - pNFS fixes for layout return on close
 - Fixes for the RCU handling in the rpcsec_gss code
 - More NFS/RDMA fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
 TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
 mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
 RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
 nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
 OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
 f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
 gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
 t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
 VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
 NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
 ivr2s7aIDV48jhnwBSVL
 =KE07
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for a bug in nfs3_list_one_acl()
   - speed up NFS path walks by supporting LOOKUP_RCU
   - more read/write code cleanups
   - pNFS fixes for layout return on close
   - fixes for the RCU handling in the rpcsec_gss code
   - more NFS/RDMA fixes"

* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  nfs: reject changes to resvport and sharecache during remount
  NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
  SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
  NFS: fix two problems in lookup_revalidate in RCU-walk
  NFS: allow lockless access to access_cache
  NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
  NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
  NFS: support RCU_WALK in nfs_permission()
  sunrpc/auth: allow lockless (rcu) lookup of credential cache.
  NFS: prepare for RCU-walk support but pushing tests later in code.
  NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
  NFS: add checks for returned value of try_module_get()
  nfs: clear_request_commit while holding i_lock
  pnfs: add pnfs_put_lseg_async
  pnfs: find swapped pages on pnfs commit lists too
  nfs: fix comment and add warn_on for PG_INODE_REF
  nfs: check wait_on_bit_lock err in page_group_lock
  sunrpc: remove "ec" argument from encrypt_v2 operation
  sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
  sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
  ...
2014-08-13 18:13:19 -06:00
Linus Torvalds 8d2d441ac4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is a lot of refactoring and hardening of the libceph and rbd
  code here from Ilya that fix various smaller bugs, and a few more
  important fixes with clone overlap.  The main fix is a critical change
  to the request_fn handling to not sleep that was exposed by the recent
  mutex changes (which will also go to the 3.16 stable series).

  Yan Zheng has several fixes in here for CephFS fixing ACL handling,
  time stamps, and request resends when the MDS restarts.

  Finally, there are a few cleanups from Himangi Saraogi based on
  Coccinelle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
  libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
  rbd: remove extra newlines from rbd_warn() messages
  rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
  rbd: rework rbd_request_fn()
  ceph: fix kick_requests()
  ceph: fix append mode write
  ceph: fix sizeof(struct tYpO *) typo
  ceph: remove redundant memset(0)
  rbd: take snap_id into account when reading in parent info
  rbd: do not read in parent info before snap context
  rbd: update mapping size only on refresh
  rbd: harden rbd_dev_refresh() and callers a bit
  rbd: split rbd_dev_spec_update() into two functions
  rbd: remove unnecessary asserts in rbd_dev_image_probe()
  rbd: introduce rbd_dev_header_info()
  rbd: show the entire chain of parent images
  ceph: replace comma with a semicolon
  rbd: use rbd_segment_name_free() instead of kfree()
  ceph: check zero length in ceph_sync_read()
  ceph: reset r_resend_mds after receiving -ESTALE
  ...
2014-08-13 17:43:29 -06:00
Vlad Yasevich 0d5501c1c8 net: Always untag vlan-tagged traffic on input.
Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel.  When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets.  This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.

There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver.  These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb.  When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.

The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued.  The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0

This connection finally times out.

I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver.  There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.

The patch attempt to fix this another way.  It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core.  This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.

CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-11 12:16:51 -07:00
David S. Miller f00439e2e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains fixes for your net tree, they are:

1) Unitialize the set element key and data from the commit path,
   otherwise this leaks chain refcount if the transaction is aborted,
   reported by Thomas Graf.

2) Fix crash when updating chains without no counters in nf_tables,
   this slipped through in the new transaction infrastructure, reported
   by Matteo Croce.

3) Replace all mutex_lock_interruptible() by mutex_lock() in the Netfilter
   tree, suggested by Patrick McHardy. This implicitly fixes the problem
   that Eric Dumazet reported in: http://patchwork.ozlabs.org/patch/373076/

4) Fix error return code in nf_tables when deleting set element in
   nf_tables if the transaction cannot be allocated, from Julia Lawall.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-11 10:26:15 -07:00
Linus Torvalds 77e40aae76 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a bunch of small changes built against 3.16-rc6.  The most
  significant change for users is the first patch which makes setns
  drmatically faster by removing unneded rcu handling.

  The next chunk of changes are so that "mount -o remount,.." will not
  allow the user namespace root to drop flags on a mount set by the
  system wide root.  Aks this forces read-only mounts to stay read-only,
  no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
  mounts to stay no exec and it prevents unprivileged users from messing
  with a mounts atime settings.  I have included my test case as the
  last patch in this series so people performing backports can verify
  this change works correctly.

  The next change fixes a bug in NFS that was discovered while auditing
  nsproxy users for the first optimization.  Today you can oops the
  kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
  with pid namespaces.  I rebased and fixed the build of the
  !CONFIG_NFS_FS case yesterday when a build bot caught my typo.  Given
  that no one to my knowledge bases anything on my tree fixing the typo
  in place seems more responsible that requiring a typo-fix to be
  backported as well.

  The last change is a small semantic cleanup introducing
  /proc/thread-self and pointing /proc/mounts and /proc/net at it.  This
  prevents several kinds of problemantic corner cases.  It is a
  user-visible change so it has a minute chance of causing regressions
  so the change to /proc/mounts and /proc/net are individual one line
  commits that can be trivially reverted.  Unfortunately I lost and
  could not find the email of the original reporter so he is not
  credited.  From at least one perspective this change to /proc/net is a
  refgression fix to allow pthread /proc/net uses that were broken by
  the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
  proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
  proc: Implement /proc/thread-self to point at the directory of the current thread
  proc: Have net show up under /proc/<tgid>/task/<tid>
  NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
  mnt: Add tests for unprivileged remount cases that have found to be faulty
  mnt: Change the default remount atime from relatime to the existing value
  mnt: Correct permission checks in do_remount
  mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  mnt: Only change user settable mount flags in remount
  namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-09 17:10:41 -07:00
Linus Torvalds 0d10c2c170 Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "This includes a major rewrite of the NFSv4 state code, which has
  always depended on a single mutex.  As an example, open creates are no
  longer serialized, fixing a performance regression on NFSv3->NFSv4
  upgrades.  Thanks to Jeff, Trond, and Benny, and to Christoph for
  review.

  Also some RDMA fixes from Chuck Lever and Steve Wise, and
  miscellaneous fixes from Kinglong Mee and others"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
  svcrdma: remove rdma_create_qp() failure recovery logic
  nfsd: add some comments to the nfsd4 object definitions
  nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
  nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
  nfsd: remove nfs4_lock_state: nfs4_laundromat
  nfsd: Remove nfs4_lock_state(): reclaim_complete()
  nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
  nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
  nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
  nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
  nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
  nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
  nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
  nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
  nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
  nfsd: remove old fault injection infrastructure
  nfsd: add more granular locking to *_delegations fault injectors
  nfsd: add more granular locking to forget_openowners fault injector
  nfsd: add more granular locking to forget_locks fault injector
  nfsd: add a list_head arg to nfsd_foreach_client_lock
  ...
2014-08-09 14:31:18 -07:00
Ilya Dryomov 5f740d7e15 libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is <=
PAGE_SIZE, which results in invalid length passed to
ceph_tcp_{send,recv}page() and causes various asserts to fire.

    # cat pages-cursor-init.sh
    #!/bin/bash
    rbd create --size 10 --image-format 2 foo
    FOO_DEV=$(rbd map foo)
    dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    # rbd_resize calls librbd rbd_resize(), size is in bytes
    ./rbd_resize bar $(((4 << 20) + 512))
    rbd resize --size 10 bar
    BAR_DEV=$(rbd map bar)
    # trigger a 512-byte copyup -- 512-byte page array data item
    dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing.  The size_t cast is
unnecessary.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-08-09 11:27:32 +04:00
Jiri Benc 945a36761f rtnetlink: fix VF info size
Commit 1d8faf48c7 ("net/core: Add VF link state control") added new
attribute to IFLA_VF_INFO group in rtnl_fill_ifinfo but did not adjust size
of the allocated memory in if_nlmsg_size/rtnl_vfinfo_size. As the result, we
may trigger warnings in rtnl_getlink and similar functions when many VF
links are enabled, as the information does not fit into the allocated skb.

Fixes: 1d8faf48c7 ("net/core: Add VF link state control")
Reported-by: Yulong Pei <ypei@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-08 10:28:09 -07:00
Niv Yehezkel b7a71b51ee ipv4: removed redundant conditional
Since fib_lookup cannot return ESRCH no longer,
checking for this error code is no longer neccesary.

Signed-off-by: Niv Yehezkel <executerx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-08 10:22:22 -07:00
Julia Lawall 609ccf0877 netfilter: nf_tables: fix error return code
Convert a zero return value on error to a negative one, as returned
elsewhere in the function.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-08-08 16:47:29 +02:00
Pablo Neira Ayuso 7926dbfa4b netfilter: don't use mutex_lock_interruptible()
Eric Dumazet reports that getsockopt() or setsockopt() sometimes
returns -EINTR instead of -ENOPROTOOPT, causing headaches to
application developers.

This patch replaces all the mutex_lock_interruptible() by mutex_lock()
in the netfilter tree, as there is no reason we should sleep for a
long time there.

Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
2014-08-08 16:47:23 +02:00
Pablo Neira Ayuso b88825de85 netfilter: nf_tables: don't update chain with unset counters
Fix possible replacement of the per-cpu chain counters by null
pointer when updating an existing chain in the commit path.

Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-08-08 15:38:50 +02:00
Pablo Neira Ayuso a3716e70e1 netfilter: nf_tables: uninitialize element key/data from the commit path
This should happen once the element has been effectively released in
the commit path, not before. This fixes a possible chain refcount leak
if the transaction is aborted.

Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-08-08 15:38:46 +02:00
Daniel Borkmann 4e48ed883c netlink: reset network header before passing to taps
netlink doesn't set any network header offset thus when the skb is
being passed to tap devices via dev_queue_xmit_nit(), it emits klog
false positives due to it being unset like:

  ...
  [  124.990397] protocol 0000 is buggy, dev nlmon0
  [  124.990411] protocol 0000 is buggy, dev nlmon0
  ...

So just reset the network header before passing to the device; for
packet sockets that just means nothing will change - mac and net
offset hold the same value just as before.

Reported-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-07 16:02:58 -07:00
Jean Sacren 0a4dd0d786 batman: fix duplicate #include of multicast.h
The header multicast.h was included twice, so delete one of them.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-07 16:02:57 -07:00
Jean Sacren 2072ec846a openvswitch: fix duplicate #include headers
The #include headers net/genetlink.h and linux/genetlink.h both were
included twice, so delete each of the duplicate.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: dev@openvswitch.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-07 16:02:57 -07:00
Geert Uytterhoeven 2d177f3113 6lowpan: Allow 6LoWPAN to be modular
Change config symbol 6LOWPAN from type bool to type tristate, so
6LoWPAN can be built modular, just like IPV6

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-07 11:44:18 -07:00
Linus Torvalds 33caee3992 Merge branch 'akpm' (patchbomb from Andrew Morton)
Merge incoming from Andrew Morton:
 - Various misc things.
 - arch/sh updates.
 - Part of ocfs2.  Review is slow.
 - Slab updates.
 - Most of -mm.
 - printk updates.
 - lib/ updates.
 - checkpatch updates.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
  checkpatch: update $declaration_macros, add uninitialized_var
  checkpatch: warn on missing spaces in broken up quoted
  checkpatch: fix false positives for --strict "space after cast" test
  checkpatch: fix false positive MISSING_BREAK warnings with --file
  checkpatch: add test for native c90 types in unusual order
  checkpatch: add signed generic types
  checkpatch: add short int to c variable types
  checkpatch: add for_each tests to indentation and brace tests
  checkpatch: fix brace style misuses of else and while
  checkpatch: add --fix option for a couple OPEN_BRACE misuses
  checkpatch: use the correct indentation for which()
  checkpatch: add fix_insert_line and fix_delete_line helpers
  checkpatch: add ability to insert and delete lines to patch/file
  checkpatch: add an index variable for fixed lines
  checkpatch: warn on break after goto or return with same tab indentation
  checkpatch: emit a warning on file add/move/delete
  checkpatch: add test for commit id formatting style in commit log
  checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
  checkpatch: improve "no space after cast" test
  checkpatch: allow multiple const * types
  ...
2014-08-06 21:14:42 -07:00
Thomas Graf 6c8f7e7083 netlink: hold nl_sock_hash_lock during diag dump
Although RCU protection would be possible during diag dump, doing
so allows for concurrent table mutations which can render the
in-table offset between individual Netlink messages invalid and
thus cause legitimate sockets to be skipped in the dump.

Since the diag dump is relatively low volume and consistency is
more important than performance, the table mutex is held during
dump.

Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Fixes: e341694e3e ("netlink: Convert netlink_lookup() to use RCU protected hash table")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 19:17:44 -07:00
Ken Helias 1d023284c3 list: fix order of arguments for hlist_add_after(_rcu)
All other add functions for lists have the new item as first argument
and the position where it is added as second argument.  This was changed
for no good reason in this function and makes using it unnecessary
confusing.

The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Dmitry Popov 9ea88a1530 tcp: md5: check md5 signature without socket lock
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 16:00:20 -07:00
Willem de Bruijn f066e2b091 net-timestamp: cumulative tcp timestamping fixes
A set of small fixes pointed out just after the merge:
- make tcp_tx_timestamp static
- make tcp_gso_tstamp static
- use before() to compare TCP seqno, instead of cast to u64
- add tstamp to tx_flags in GSO, instead of overwrite tx_flags
- record skb_shinfo(skb)->tskey for all timestamps, also HW.
- optimization in tcp_tx_timestamp:
  call sock_tx_timestamp only if a tstamp option is set.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 14:09:01 -07:00
Eric Dumazet 140c55d4b5 net-timestamp: sock_tx_timestamp() fix
sock_tx_timestamp() should not ignore initial *tx_flags value, as TCP
stack can store SKBTX_SHARED_FRAG in it.

Also first argument (struct sock *) can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 12:38:07 -07:00
Linus Torvalds ae045e2455 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...
2014-08-06 09:38:14 -07:00
Linus Torvalds bb2cbf5e93 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this release:

   - PKCS#7 parser for the key management subsystem from David Howells
   - appoint Kees Cook as seccomp maintainer
   - bugfixes and general maintenance across the subsystem"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
  X.509: Need to export x509_request_asymmetric_key()
  netlabel: shorter names for the NetLabel catmap funcs/structs
  netlabel: fix the catmap walking functions
  netlabel: fix the horribly broken catmap functions
  netlabel: fix a problem when setting bits below the previously lowest bit
  PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
  tpm: simplify code by using %*phN specifier
  tpm: Provide a generic means to override the chip returned timeouts
  tpm: missing tpm_chip_put in tpm_get_random()
  tpm: Properly clean sysfs entries in error path
  tpm: Add missing tpm_do_selftest to ST33 I2C driver
  PKCS#7: Use x509_request_asymmetric_key()
  Revert "selinux: fix the default socket labeling in sock_graft()"
  X.509: x509_request_asymmetric_keys() doesn't need string length arguments
  PKCS#7: fix sparse non static symbol warning
  KEYS: revert encrypted key change
  ima: add support for measuring and appraising firmware
  firmware_class: perform new LSM checks
  security: introduce kernel_fw_from_file hook
  PKCS#7: Missing inclusion of linux/err.h
  ...
2014-08-06 08:06:39 -07:00
David S. Miller d247b6ab3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/Makefile
	net/ipv6/sysctl_net_ipv6.c

Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.

In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 18:46:26 -07:00
Eric Dumazet 757efd32d5 sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Dave reported following splat, caused by improper use of
IP_INC_STATS_BH() in process context.

BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
 ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
 0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
 ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
Call Trace:
 [<ffffffff9e7ee207>] dump_stack+0x4e/0x7a
 [<ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
 [<ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
 [<ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
 [<ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
 [<ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
 [<ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [<ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
 [<ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
 [<ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
 [<ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
 [<ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
 [<ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
 [<ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
 [<ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
 [<ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
 [<ffffffff9e73b624>] inet_sendmsg+0x104/0x220
 [<ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
 [<ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
 [<ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
 [<ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
 [<ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
 [<ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
 [<ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
 [<ffffffff9e7f9be4>] tracesys+0xdd/0xe2

This is a followup of commits f1d8cba61c ("inet: fix possible
seqlock deadlocks") and 7f88c6b23a ("ipv6: fix possible seqlock
deadlock in ip6_finish_output2")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:42:55 -07:00
David S. Miller 6ff4e36f8b Included changes:
- kmalloc_array instead of kmalloc when possible
 - avoid log spam due to useless net_ratelimit() invocations
 - increase default metric hop penalty from 15 to 30
 - update internal version number
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJT4Ir0AAoJEJgn97Bh2u9eYG8P/2Bx9uCbajNQSCBvQ20OXW4o
 YzubqMETLel6GLoVjL2XFsmJATaQGGWYWGSytDF2sRa1C7/dXXfUrwgfFNzw2L9Z
 nZSd8WbhiBW0LK/ECzTQ12M3VnTdjgj3IP9A58N9zqQf5uM1onQR6gR7IQGARn9S
 D5onFdoRz5bWcVffyf4YN9irMkvAFjOB7OzNtBaHRbH5oObITS10LcQ7rVLAic/0
 lyMCX1ioBnFpbH1YfII0dcSFjY22m9QTjJDj4dx6LbjDXaxv9BMiJ7bfjCt5wn1o
 0ju5X898fhTX0L4Z67pGGHzawByyXtrf1r9INot8K8oqftq5vkHpIJdn9n0GrYGa
 9FVoQ0hzDVSp7mxnBKnbXyaf47HX8RYRBZegHEDW8zrpO0zj7cyruRKifDL3q9R9
 cqGIRYSFPaXEGF6HfmxWXCjst1VpddcTBgD7OWlKuTi/Juk2ZoWI4O+h3WlotOv8
 niTiG0DiNoGlrLXjFlya6TdfL43zg3sE8O4oukcVO4uejo+wdL+Rc7ZGuO7CbO9k
 V+ORQ/8Efg3XD8xRSmzi0pgyBfU/C6HTTWCIv8DIMfl0B5zoeIx75quDP9/CKIfz
 UTF0R1+mA0o6UmSiIngqRTFEEB9cWrMvCjM1drMb1V99NzI+/fg1skO7pcnJRddC
 rbPgeTHDAYgMsPj74rhs
 =EIGl
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
pull request: batman-adv 2014-08-05

this is a pull request intended for net-next/linux-3.17 (yeah..it's really
late).

Patches 1, 2 and 4 are really minor changes:
- kmalloc_array is substituted to kmalloc when possible (as suggested by
  checkpatch);
- net_ratelimited() is now used properly and the "suppressed" message is not
  printed anymore if not needed;
- the internal version number has been increased to reflect our current version.

Patch 3 instead is introducing a change in the metric computation function
by changing the penalty applied at each mesh hop from 15/255 (~6%) to
30/255 (~11%). This change is introduced by Simon Wunderlich after having
observed a performance improvement in several networks when using the new value.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:39:47 -07:00
Toshiaki Makita fdb0a6626e bridge: Update outdated comment on promiscuous mode
Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
unnecessary operation.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:37:10 -07:00
Willem de Bruijn e1c8a607b2 net-timestamp: ACK timestamp for bytestreams
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.

The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn 4ed2d765df net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.

Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.

The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.

This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.

This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.

If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.

The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.

Implementation details:

- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.

- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn e7fd288538 net-timestamp: SCHED timestamp on entering packet scheduler
Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn 09c2d251b7 net-timestamp: add key to disambiguate concurrent datagrams
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn b9f40e21ef net-timestamp: move timestamp flags out of sk_flags
sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.

Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn f24b9be595 net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].

1) define struct scm_timestamping explicitly and

2) add support for new tstamp types. On tx, scm_timestamping always
   accompanies a sock_extended_err. Define previously unused field
   ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
   define the existing behavior.

The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:53 -07:00
Neal Cardwell 5ae344c949 tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

 (1) 90% of inter-ACK delays were less than 10ms
 (2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:29:33 -07:00
David S. Miller aef4f5b6db Merge tag 'master-2014-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
Conflicts:
	net/6lowpan/iphc.c

Minor conflicts in iphc.c were changes overlapping with some
style cleanups.

John W. Linville says:

====================
Please pull this last(?) batch of wireless change intended for the
3.17 stream...

For the NFC bits, Samuel says:

"This is a rather quiet one, we have:

- A new driver from ST Microelectronics for their NCI ST21NFCB,
  including device tree  support.

- p2p support for the ST21NFCA driver

- A few fixes an enhancements for the NFC digital laye"

For the Atheros bits, Kalle says:

"Michal and Janusz did some important RX aggregation fixes, basically we
were missing RX reordering altogether. The 10.1 firmware doesn't support
Ad-Hoc mode and Michal fixed ath10k so that it doesn't advertise Ad-Hoc
support with that firmware. Also he implemented a workaround for a KVM
issue."

For the Bluetooth bits, Gustavo and Johan say:

"To quote Gustavo from his previous request:

'Some last minute fixes for -next. We have a fix for a use after free in
RFCOMM, another fix to an issue with ADV_DIRECT_IND and one for ADV_IND with
auto-connection handling.  Last, we added support for reading the codec and
MWS setting for controllers that support these features.'

Additionally there are fixes to LE scanning, an update to conform to the 4.1
core specification as well as fixes for tracking the page scan state. All
of these fixes are important for 3.17."

And,

"We've got:

- 6lowpan fixes/cleanups
- A couple crash fixes, one for the Marvell HCI driver and another in LE SMP.
- Fix for an incorrect connected state check
- Fix for the bondable requirement during pairing (an issue which had
  crept in because of using "pairable" when in fact the actual meaning
  was "bondable" (these have different meanings in Bluetooth)"

Along with those are some late-breaking hardware support patches in
brcmfmac and b43 as well as a stray ath9k patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 13:18:20 -07:00
Steve Wise d1e458fe67 svcrdma: remove rdma_create_qp() failure recovery logic
In svc_rdma_accept(), if rdma_create_qp() fails, there is useless
logic to try and call rdma_create_qp() again with reduced sge depths.
The assumption, I guess, was that perhaps the initial sge depths
chosen were too big.  However they initial depths are selected based
on the rdma device attribute max_sge returned from ib_query_device().
If rdma_create_qp() fails, it would not be because the max_send_sge and
max_recv_sge values passed in exceed the device's max.  So just remove
this code.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 16:09:21 -04:00