Граф коммитов

9843 Коммитов

Автор SHA1 Сообщение Дата
Gerrit Renker 73f18fdbca dccp: Bug-Fix - AWL was never updated
The AWL lower Ack validity window advances in proportion to GSS, the greatest
sequence number sent. Updating AWL other than at connection setup (in the
DCCP-Request sent by dccp_v{4,6}_connect()) was missing in the DCCP code.

This bug lead to syslog messages such as

 "kernel: dccp_check_seqno: DCCP: Step 6 failed for DATAACK packet, [...] 
  P.ackno exists or LAWL(82947089) <= P.ackno(82948208)
                                   <= S.AWH(82948728), sending SYNC..."

The difference between AWL/AWH here is 1639 packets, while the expected value
(the Sequence Window) would have been 100 (the default).  A closer look showed
that LAWL = AWL = 82947089 equalled the ISS on the Response.

The patch now updates AWL with each increase of GSS.


Further changes:
----------------
The patch also enforces more stringent checks on the ISS sequence number:

 * AWL is initialised to ISS at connection setup and remains at this value;
 * AWH is then always set to GSS (via dccp_update_gss());
 * so on the first Request: AWL =      AWH = ISS,
   and on the n-th Request: AWL = ISS, AWH = ISS + n.

As a consequence, only Response packets that refer to Requests sent by this
host will pass, all others are discarded. This is the intention and in effect 
implements the initial adjustments for AWL as specified in RFC 4340, 7.5.1.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>   
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
2008-07-26 11:59:10 +01:00
Gerrit Renker 59435444a1 dccp: Allow to distinguish original and retransmitted packets
This patch allows the sender to distinguish original and retransmitted packets,
which is in particular needed for the retransmission of DCCP-Requests:
 * the first Request uses ISS (generated in net/dccp/ip*.c), and sets GSS = ISS;
 * all retransmitted Requests use GSS' = GSS + 1, so that the n-th retransmitted
   Request has sequence number ISS + n (mod 48).

To add generic support, the patch reorganises existing code so that:
 * icsk_retransmits == 0     for the original packet and
 * icsk_retransmits = n > 0  for the n-th retransmitted packet
at the time dccp_transmit_skb() is called, via dccp_retransmit_skb().
 
Thanks to Wei Yongjun for pointing this problem out.

Further changes:
----------------
 * removed the `skb' argument from dccp_retransmit_skb(), since sk_send_head
   is used for all retransmissions (the exception is client-Acks in PARTOPEN
   state, but these do not use sk_send_head);
 * since sk_send_head always contains the original skb (via dccp_entail()),
   skb_cloned() never evaluated to true and thus pskb_copy() was never used.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-07-26 11:59:09 +01:00
David S. Miller cdec7e50a4 Revert "pkt_sched: sch_sfq: dump a real number of flows"
This reverts commit f867e6af94.

Based upon discussions between Jarek and Patrick McHardy
this is field being set is more a config parameter than a
statistic.  And we should add a true statistic to provide
this information if we really want it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-26 02:28:09 -07:00
Florian Westphal 16df845f45 syncookies: Make sure ECN is disabled
ecn_ok is not initialized when a connection is established by cookies.
The cookie syn-ack never sets ECN, so ecn_ok must be set to 0.

Spotted using ns-3/network simulation cradle simulator and valgrind.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-26 02:21:54 -07:00
Ilpo Järvinen 547b792cac net: convert BUG_TRAP to generic WARN_ON
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 21:43:18 -07:00
Linus Torvalds 1ff8419871 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  ipsec: ipcomp - Decompress into frags if necessary
  ipsec: ipcomp - Merge IPComp implementations
  pkt_sched: Fix locking in shutdown_scheduler_queue()
2008-07-25 17:40:16 -07:00
Stephen Hemminger 4ecb90090c sysctl: allow override of /proc/sys/net with CAP_NET_ADMIN
Extend the permission check for networking sysctl's to allow modification
when current process has CAP_NET_ADMIN capability and is not root.  This
version uses the until now unused permissions hook to override the mode
value for /proc/sys/net if accessed by a user with capabilities.

Found while working with Quagga.  It is impossible to turn forwarding
on/off through the command interface because Quagga uses secure coding
practice of dropping privledges during initialization and only raising via
capabilities when necessary.  Since the dameon has reset real/effective
uid after initialization, all attempts to access /proc/sys/net variables
will fail.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:45 -07:00
Dave Young 717115e1a5 printk ratelimiting rewrite
All ratelimit user use same jiffies and burst params, so some messages
(callbacks) will be lost.

For example:
a call printk_ratelimit(5 * HZ, 1)
b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will
will be supressed.

- rewrite __ratelimit, and use a ratelimit_state as parameter.  Thanks for
  hints from andrew.

- Add WARN_ON_RATELIMIT, update rcupreempt.h

- remove __printk_ratelimit

- use __ratelimit in net_ratelimit

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:29 -07:00
Paul E. McKenney 696adfe84c list_for_each_rcu must die: networking
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu().  This patch makes this change for
networking, in preparation for removing the list_for_each_rcu() API
entirely.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:27 -07:00
Herbert Xu 7d7e5a60c6 ipsec: ipcomp - Decompress into frags if necessary
When decompressing extremely large packets allocating them through
kmalloc is prone to failure.  Therefore it's better to use page
frags instead.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 02:55:33 -07:00
Herbert Xu 6fccab671f ipsec: ipcomp - Merge IPComp implementations
This patch merges the IPv4/IPv6 IPComp implementations since most
of the code is identical.  As a result future enhancements will no
longer need to be duplicated.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 02:54:40 -07:00
David S. Miller cffe1c5d7a pkt_sched: Fix locking in shutdown_scheduler_queue()
Qdisc locks need to be held with BH disabled.

Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 01:25:04 -07:00
Linus Torvalds c3c2233d84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  pkt_sched: sch_sfq: dump a real number of flows
  atm: [fore200e] use MODULE_FIRMWARE() and other suggested cleanups
  netfilter: make security table depend on NETFILTER_ADVANCED
  tcp: Clear probes_out more aggressively in tcp_ack().
  e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
  net: Update entry in af_family_clock_key_strings
  netdev: Remove warning from __netif_schedule().
  sky2: don't stop queue on shutdown
2008-07-24 12:14:58 -07:00
Ulrich Drepper e38b36f325 flag parameters: check magic constants
This patch adds test that ensure the boundary conditions for the various
constants introduced in the previous patches is met.  No code is generated.

[akpm@linux-foundation.org: fix alpha]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper 77d2720059 flag parameters: NONBLOCK in socket and socketpair
This patch introduces support for the SOCK_NONBLOCK flag in socket,
socketpair, and  paccept.  To do this the internal function sock_attach_fd
gets an additional parameter which it uses to set the appropriate flag for
the file descriptor.

Given that in modern, scalable programs almost all socket connections are
non-blocking and the minimal additional cost for the new functionality
I see no reason not to add this code.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/syscall.h>

#ifndef __NR_paccept
# ifdef __x86_64__
#  define __NR_paccept 288
# elif defined __i386__
#  define SYS_PACCEPT 18
#  define USE_SOCKETCALL 1
# else
#  error "need __NR_paccept"
# endif
#endif

#ifdef USE_SOCKETCALL
# define paccept(fd, addr, addrlen, mask, flags) \
  ({ long args[6] = { \
       (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
     syscall (__NR_socketcall, SYS_PACCEPT, args); })
#else
# define paccept(fd, addr, addrlen, mask, flags) \
  syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
#endif

#define PORT 57392

#define SOCK_NONBLOCK O_NONBLOCK

static pthread_barrier_t b;

static void *
tf (void *arg)
{
  pthread_barrier_wait (&b);
  int s = socket (AF_INET, SOCK_STREAM, 0);
  struct sockaddr_in sin;
  sin.sin_family = AF_INET;
  sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
  sin.sin_port = htons (PORT);
  connect (s, (const struct sockaddr *) &sin, sizeof (sin));
  close (s);
  pthread_barrier_wait (&b);

  pthread_barrier_wait (&b);
  s = socket (AF_INET, SOCK_STREAM, 0);
  sin.sin_port = htons (PORT);
  connect (s, (const struct sockaddr *) &sin, sizeof (sin));
  close (s);
  pthread_barrier_wait (&b);

  return NULL;
}

int
main (void)
{
  int fd;
  fd = socket (PF_INET, SOCK_STREAM, 0);
  if (fd == -1)
    {
      puts ("socket(0) failed");
      return 1;
    }
  int fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (fl & O_NONBLOCK)
    {
      puts ("socket(0) set non-blocking mode");
      return 1;
    }
  close (fd);

  fd = socket (PF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
  if (fd == -1)
    {
      puts ("socket(SOCK_NONBLOCK) failed");
      return 1;
    }
  fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("socket(SOCK_NONBLOCK) does not set non-blocking mode");
      return 1;
    }
  close (fd);

  int fds[2];
  if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
      puts ("socketpair(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (fl & O_NONBLOCK)
        {
          printf ("socketpair(0) set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, fds) == -1)
    {
      puts ("socketpair(SOCK_NONBLOCK) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((fl & O_NONBLOCK) == 0)
        {
          printf ("socketpair(SOCK_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  pthread_barrier_init (&b, NULL, 2);

  struct sockaddr_in sin;
  pthread_t th;
  if (pthread_create (&th, NULL, tf, NULL) != 0)
    {
      puts ("pthread_create failed");
      return 1;
    }

  int s = socket (AF_INET, SOCK_STREAM, 0);
  int reuse = 1;
  setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
  sin.sin_family = AF_INET;
  sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
  sin.sin_port = htons (PORT);
  bind (s, (struct sockaddr *) &sin, sizeof (sin));
  listen (s, SOMAXCONN);

  pthread_barrier_wait (&b);

  int s2 = paccept (s, NULL, 0, NULL, 0);
  if (s2 < 0)
    {
      puts ("paccept(0) failed");
      return 1;
    }

  fl = fcntl (s2, F_GETFL);
  if (fl & O_NONBLOCK)
    {
      puts ("paccept(0) set non-blocking mode");
      return 1;
    }
  close (s2);
  close (s);

  pthread_barrier_wait (&b);

  s = socket (AF_INET, SOCK_STREAM, 0);
  sin.sin_port = htons (PORT);
  setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
  bind (s, (struct sockaddr *) &sin, sizeof (sin));
  listen (s, SOMAXCONN);

  pthread_barrier_wait (&b);

  s2 = paccept (s, NULL, 0, NULL, SOCK_NONBLOCK);
  if (s2 < 0)
    {
      puts ("paccept(SOCK_NONBLOCK) failed");
      return 1;
    }

  fl = fcntl (s2, F_GETFL);
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("paccept(SOCK_NONBLOCK) does not set non-blocking mode");
      return 1;
    }
  close (s2);
  close (s);

  pthread_barrier_wait (&b);
  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper c019bbc612 flag parameters: paccept w/out set_restore_sigmask
Some platforms do not have support to restore the signal mask in the
return path from a syscall.  For those platforms syscalls like pselect are
not defined at all.  This is, I think, not a good choice for paccept()
since paccept() adds more value on top of accept() than just the signal
mask handling.

Therefore this patch defines a scaled down version of the sys_paccept
function for those platforms.  It returns -EINVAL in case the signal mask
is non-NULL but behaves the same otherwise.

Note that I explicitly included <linux/thread_info.h>.  I saw that it is
currently included but indirectly two levels down.  There is too much risk
in relying on this.  The header might change and then suddenly the
function definition would change without anyone immediately noticing.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Ulrich Drepper aaca0bdca5 flag parameters: paccept
This patch is by far the most complex in the series.  It adds a new syscall
paccept.  This syscall differs from accept in that it adds (at the userlevel)
two additional parameters:

- a signal mask
- a flags value

The flags parameter can be used to set flag like SOCK_CLOEXEC.  This is
imlpemented here as well.  Some people argued that this is a property which
should be inherited from the file desriptor for the server but this is against
POSIX.  Additionally, we really want the signal mask parameter as well
(similar to pselect, ppoll, etc).  So an interface change in inevitable.

The flag value is the same as for socket and socketpair.  I think diverging
here will only create confusion.  Similar to the filesystem interfaces where
the use of the O_* constants differs, it is acceptable here.

The signal mask is handled as for pselect etc.  The mask is temporarily
installed for the thread and removed before the call returns.  I modeled the
code after pselect.  If there is a problem it's likely also in pselect.

For architectures which use socketcall I maintained this interface instead of
adding a system call.  The symmetry shouldn't be broken.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/syscall.h>

#ifndef __NR_paccept
# ifdef __x86_64__
#  define __NR_paccept 288
# elif defined __i386__
#  define SYS_PACCEPT 18
#  define USE_SOCKETCALL 1
# else
#  error "need __NR_paccept"
# endif
#endif

#ifdef USE_SOCKETCALL
# define paccept(fd, addr, addrlen, mask, flags) \
  ({ long args[6] = { \
       (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
     syscall (__NR_socketcall, SYS_PACCEPT, args); })
#else
# define paccept(fd, addr, addrlen, mask, flags) \
  syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
#endif

#define PORT 57392

#define SOCK_CLOEXEC O_CLOEXEC

static pthread_barrier_t b;

static void *
tf (void *arg)
{
  pthread_barrier_wait (&b);
  int s = socket (AF_INET, SOCK_STREAM, 0);
  struct sockaddr_in sin;
  sin.sin_family = AF_INET;
  sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
  sin.sin_port = htons (PORT);
  connect (s, (const struct sockaddr *) &sin, sizeof (sin));
  close (s);

  pthread_barrier_wait (&b);
  s = socket (AF_INET, SOCK_STREAM, 0);
  sin.sin_port = htons (PORT);
  connect (s, (const struct sockaddr *) &sin, sizeof (sin));
  close (s);
  pthread_barrier_wait (&b);

  pthread_barrier_wait (&b);
  sleep (2);
  pthread_kill ((pthread_t) arg, SIGUSR1);

  return NULL;
}

static void
handler (int s)
{
}

int
main (void)
{
  pthread_barrier_init (&b, NULL, 2);

  struct sockaddr_in sin;
  pthread_t th;
  if (pthread_create (&th, NULL, tf, (void *) pthread_self ()) != 0)
    {
      puts ("pthread_create failed");
      return 1;
    }

  int s = socket (AF_INET, SOCK_STREAM, 0);
  int reuse = 1;
  setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
  sin.sin_family = AF_INET;
  sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
  sin.sin_port = htons (PORT);
  bind (s, (struct sockaddr *) &sin, sizeof (sin));
  listen (s, SOMAXCONN);

  pthread_barrier_wait (&b);

  int s2 = paccept (s, NULL, 0, NULL, 0);
  if (s2 < 0)
    {
      puts ("paccept(0) failed");
      return 1;
    }

  int coe = fcntl (s2, F_GETFD);
  if (coe & FD_CLOEXEC)
    {
      puts ("paccept(0) set close-on-exec-flag");
      return 1;
    }
  close (s2);

  pthread_barrier_wait (&b);

  s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC);
  if (s2 < 0)
    {
      puts ("paccept(SOCK_CLOEXEC) failed");
      return 1;
    }

  coe = fcntl (s2, F_GETFD);
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag");
      return 1;
    }
  close (s2);

  pthread_barrier_wait (&b);

  struct sigaction sa;
  sa.sa_handler = handler;
  sa.sa_flags = 0;
  sigemptyset (&sa.sa_mask);
  sigaction (SIGUSR1, &sa, NULL);

  sigset_t ss;
  pthread_sigmask (SIG_SETMASK, NULL, &ss);
  sigaddset (&ss, SIGUSR1);
  pthread_sigmask (SIG_SETMASK, &ss, NULL);

  sigdelset (&ss, SIGUSR1);
  alarm (4);
  pthread_barrier_wait (&b);

  errno = 0 ;
  s2 = paccept (s, NULL, 0, &ss, 0);
  if (s2 != -1 || errno != EINTR)
    {
      puts ("paccept did not fail with EINTR");
      return 1;
    }

  close (s);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[akpm@linux-foundation.org: make it compile]
[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Roland McGrath <roland@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Ulrich Drepper a677a039be flag parameters: socket and socketpair
This patch adds support for flag values which are ORed to the type passwd
to socket and socketpair.  The additional code is minimal.  The flag
values in this implementation can and must match the O_* flags.  This
avoids overhead in the conversion.

The internal functions sock_alloc_fd and sock_map_fd get a new parameters
and all callers are changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>

#define PORT 57392

/* For Linux these must be the same.  */
#define SOCK_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd;
  fd = socket (PF_INET, SOCK_STREAM, 0);
  if (fd == -1)
    {
      puts ("socket(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("socket(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = socket (PF_INET, SOCK_STREAM|SOCK_CLOEXEC, 0);
  if (fd == -1)
    {
      puts ("socket(SOCK_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("socket(SOCK_CLOEXEC) does not set close-on-exec flag");
      return 1;
    }
  close (fd);

  int fds[2];
  if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
      puts ("socketpair(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      coe = fcntl (fds[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (coe & FD_CLOEXEC)
        {
          printf ("socketpair(0) set close-on-exec flag for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, fds) == -1)
    {
      puts ("socketpair(SOCK_CLOEXEC) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      coe = fcntl (fds[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((coe & FD_CLOEXEC) == 0)
        {
          printf ("socketpair(SOCK_CLOEXEC) does not set close-on-exec flag for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Jarek Poplawski f867e6af94 pkt_sched: sch_sfq: dump a real number of flows
Dump the "flows" number according to the number of active flows
instead of repeating the "limit".

Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23 21:34:27 -07:00
Linus Torvalds 26dcce0fab Merge branch 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
  cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
  NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
  NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
  cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
  cpumask: Provide a generic set of CPUMASK_ALLOC macros
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
  cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
  cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
  cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
  Revert "cpumask: introduce new APIs"
  cpumask: make for_each_cpu_mask a bit smaller
  net: Pass reference to cpumask variable in net/sunrpc/svc.c
  ...

Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
2008-07-23 18:37:44 -07:00
Patrick McHardy 70eed75d76 netfilter: make security table depend on NETFILTER_ADVANCED
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23 16:42:42 -07:00
David S. Miller 4b53fb67e3 tcp: Clear probes_out more aggressively in tcp_ack().
This is based upon an excellent bug report from Eric Dumazet.

tcp_ack() should clear ->icsk_probes_out even if there are packets
outstanding.  Otherwise if we get a sequence of ACKs while we do have
packets outstanding over and over again, we'll never clear the
probes_out value and eventually think the connection is too sick and
we'll reset it.

This appears to be some "optimization" added to tcp_ack() in the 2.4.x
timeframe.  In 2.2.x, probes_out is pretty much always cleared by
tcp_ack().

Here is Eric's original report:

----------------------------------------
Apparently, we can in some situations reset TCP connections in a couple of seconds when some frames are lost.

In order to reproduce the problem, please try the following program on linux-2.6.25.*

Setup some iptables rules to allow two frames per second sent on loopback interface to tcp destination port 12000

iptables -N SLOWLO
iptables -A SLOWLO -m hashlimit --hashlimit 2 --hashlimit-burst 1 --hashlimit-mode dstip --hashlimit-name slow2 -j ACCEPT
iptables -A SLOWLO -j DROP

iptables -A OUTPUT -o lo -p tcp --dport 12000 -j SLOWLO

Then run the attached program and see the output :

# ./loop
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,1)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,3)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,5)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,7)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,9)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,11)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,201ms,13)
State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,188ms,15)
write(): Connection timed out
wrote 890 bytes but was interrupted after 9 seconds
ESTAB      0      0                 127.0.0.1:12000            127.0.0.1:54455
Exiting read() because no data available (4000 ms timeout).
read 860 bytes

While this tcp session makes progress (sending frames with 50 bytes of payload, every 500ms), linux tcp stack decides to reset it, when tcp_retries 2 is reached (default value : 15)

tcpdump :

15:30:28.856695 IP 127.0.0.1.56554 > 127.0.0.1.12000: S 33788768:33788768(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
15:30:28.856711 IP 127.0.0.1.12000 > 127.0.0.1.56554: S 33899253:33899253(0) ack 33788769 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
15:30:29.356947 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 1:61(60) ack 1 win 257
15:30:29.356966 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 61 win 257
15:30:29.866415 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 61:111(50) ack 1 win 257
15:30:29.866427 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 111 win 257
15:30:30.366516 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 111:161(50) ack 1 win 257
15:30:30.366527 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 161 win 257
15:30:30.876196 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 161:211(50) ack 1 win 257
15:30:30.876207 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 211 win 257
15:30:31.376282 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 211:261(50) ack 1 win 257
15:30:31.376290 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 261 win 257
15:30:31.885619 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 261:311(50) ack 1 win 257
15:30:31.885631 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 311 win 257
15:30:32.385705 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 311:361(50) ack 1 win 257
15:30:32.385715 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 361 win 257
15:30:32.895249 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 361:411(50) ack 1 win 257
15:30:32.895266 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 411 win 257
15:30:33.395341 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 411:461(50) ack 1 win 257
15:30:33.395351 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 461 win 257
15:30:33.918085 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 461:511(50) ack 1 win 257
15:30:33.918096 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 511 win 257
15:30:34.418163 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 511:561(50) ack 1 win 257
15:30:34.418172 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 561 win 257
15:30:34.927685 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 561:611(50) ack 1 win 257
15:30:34.927698 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 611 win 257
15:30:35.427757 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 611:661(50) ack 1 win 257
15:30:35.427766 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 661 win 257
15:30:35.937359 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 661:711(50) ack 1 win 257
15:30:35.937376 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 711 win 257
15:30:36.437451 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 711:761(50) ack 1 win 257
15:30:36.437464 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 761 win 257
15:30:36.947022 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 761:811(50) ack 1 win 257
15:30:36.947039 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 811 win 257
15:30:37.447135 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 811:861(50) ack 1 win 257
15:30:37.447203 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 861 win 257
15:30:41.448171 IP 127.0.0.1.12000 > 127.0.0.1.56554: F 1:1(0) ack 861 win 257
15:30:41.448189 IP 127.0.0.1.56554 > 127.0.0.1.12000: R 33789629:33789629(0) win 0

Source of program :

/*
 * small producer/consumer program.
 * setup a listener on 127.0.0.1:12000
 * Forks a child
 *   child connect to 127.0.0.1, and sends 10 bytes on this tcp socket every 100 ms
 * Father accepts connection, and read all data
 */
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <sys/poll.h>

int port = 12000;
char buffer[4096];
int main(int argc, char *argv[])
{
        int lfd = socket(AF_INET, SOCK_STREAM, 0);
        struct sockaddr_in socket_address;
        time_t t0, t1;
        int on = 1, sfd, res;
        unsigned long total = 0;
        socklen_t alen = sizeof(socket_address);
        pid_t pid;

        time(&t0);
        socket_address.sin_family = AF_INET;
        socket_address.sin_port = htons(port);
        socket_address.sin_addr.s_addr = htonl(INADDR_LOOPBACK);

        if (lfd == -1) {
                perror("socket()");
                return 1;
        }
        setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(int));
        if (bind(lfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                perror("bind");
                close(lfd);
                return 1;
        }
        if (listen(lfd, 1) == -1) {
                perror("listen()");
                close(lfd);
                return 1;
        }
        pid = fork();
        if (pid == 0) {
                int i, cfd = socket(AF_INET, SOCK_STREAM, 0);
                close(lfd);
                if (connect(cfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                        perror("connect()");
                        return 1;
                        }
                for (i = 0 ; ;) {
                        res = write(cfd, "blablabla\n", 10);
                        if (res > 0) total += res;
                        else if (res == -1) {
                                perror("write()");
                                break;
                        } else break;
                        usleep(100000);
                        if (++i == 10) {
                                system("ss -on dst 127.0.0.1:12000");
                                i = 0;
                        }
                }
                time(&t1);
                fprintf(stderr, "wrote %lu bytes but was interrupted after %g seconds\n", total, difftime(t1, t0));
                system("ss -on | grep 127.0.0.1:12000");
                close(cfd);
                return 0;
        }
        sfd = accept(lfd, (struct sockaddr *)&socket_address, &alen);
        if (sfd == -1) {
                perror("accept");
                return 1;
        }
        close(lfd);
        while (1) {
                struct pollfd pfd[1];
                pfd[0].fd = sfd;
                pfd[0].events = POLLIN;
                if (poll(pfd, 1, 4000) == 0) {
                        fprintf(stderr, "Exiting read() because no data available (4000 ms timeout).\n");
                        break;
                }
                res = read(sfd, buffer, sizeof(buffer));
                if (res > 0) total += res;
                else if (res == 0) break;
                else perror("read()");
        }
        fprintf(stderr, "read %lu bytes\n", total);
        close(sfd);
        return 0;
}
----------------------------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23 16:38:45 -07:00
Oliver Hartkopp b4942af650 net: Update entry in af_family_clock_key_strings
In the merge phase of the CAN subsystem the 
af_family_clock_key_strings[] have been added to sock.c in commit 
443aef0edd 
(lockdep: fixup sk_callback_lock annotation). This trivial patch adds 
the missing name for address family 29 (AF_CAN).

Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23 14:06:04 -07:00
David S. Miller 5b3ab1dbd4 netdev: Remove warning from __netif_schedule().
It isn't helping anything and we aren't going to be able to change all
the drivers that do queue wakeups in strange situations.

Just letting a noop_qdisc get scheduled will work because when
qdisc_run() executes via net_tx_work() it will simply find no packets
pending when it makes the ->dequeue() call in qdisc_restart.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23 14:01:29 -07:00
Krzysztof Hałasa a8817d2f6d wanmain.c doesn't need syncppp.h
Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>
2008-07-23 23:00:36 +02:00
Krzysztof Hałasa a24e202e3f Remove dead code from wanmain.c, CONFIG_WANPIPE_MULTPPP doesn't exist
Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>
2008-07-23 23:00:36 +02:00
Linus Torvalds 5554b35933 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (24 commits)
  I/OAT: I/OAT version 3.0 support
  I/OAT: tcp_dma_copybreak default value dependent on I/OAT version
  I/OAT: Add watchdog/reset functionality to ioatdma
  iop_adma: cleanup iop_chan_xor_slot_count
  iop_adma: document how to calculate the minimum descriptor pool size
  iop_adma: directly reclaim descriptors on allocation failure
  async_tx: make async_tx_test_ack a boolean routine
  async_tx: remove depend_tx from async_tx_sync_epilog
  async_tx: export async_tx_quiesce
  async_tx: fix handling of the "out of descriptor" condition in async_xor
  async_tx: ensure the xor destination buffer remains dma-mapped
  async_tx: list_for_each_entry_rcu() cleanup
  dmaengine: Driver for the Synopsys DesignWare DMA controller
  dmaengine: Add slave DMA interface
  dmaengine: add DMA_COMPL_SKIP_{SRC,DEST}_UNMAP flags to control dma unmap
  dmaengine: Add dma_client parameter to device_alloc_chan_resources
  dmatest: Simple DMA memcpy test client
  dmaengine: DMA engine driver for Marvell XOR engine
  iop-adma: fix platform driver hotplug/coldplug
  dmaengine: track the number of clients using a channel
  ...

Fixed up conflict in drivers/dca/dca-sysfs.c manually
2008-07-23 12:03:18 -07:00
Linus Torvalds c010b2f76c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
  ipw2200: Call netif_*_queue() interfaces properly.
  netxen: Needs to include linux/vmalloc.h
  [netdrvr] atl1d: fix !CONFIG_PM build
  r6040: rework init_one error handling
  r6040: bump release number to 0.18
  r6040: handle RX fifo full and no descriptor interrupts
  r6040: change the default waiting time
  r6040: use definitions for magic values in descriptor status
  r6040: completely rework the RX path
  r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
  mv643xx_eth: fix NETPOLL build
  r6040: rework the RX buffers allocation routine
  r6040: fix scheduling while atomic in r6040_tx_timeout
  r6040: fix null pointer access and tx timeouts
  r6040: prefix all functions with r6040
  rndis_host: support WM6 devices as modems
  at91_ether: use netstats in net_device structure
  sfc: Create one RX queue and interrupt per CPU package by default
  sfc: Use a separate workqueue for resets
  sfc: I2C adapter initialisation fixes
  ...
2008-07-22 19:09:51 -07:00
Maciej Sosnowski 16a37acaaf I/OAT: tcp_dma_copybreak default value dependent on I/OAT version
I/OAT DMA performance tuning showed different optimal values of
tcp_dma_copybreak for different I/OAT versions (4096 for 1.2 and 2048
for 2.0).  This patch lets ioatdma driver set tcp_dma_copybreak value
according to these results.

[dan.j.williams@intel.com: remove some ifdefs]
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-22 17:30:57 -07:00
Stephen Hemminger 3d0f24a74e ipv6: icmp6_dst_gc return change
Change icmp6_dst_gc to return the one value the caller cares about rather
than using call by reference.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:35:50 -07:00
Stephen Hemminger 75307c0fe7 ipv6: use kcalloc
Th fib_table_hash is an array, so use kcalloc.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:35:07 -07:00
Stephen Hemminger a76d7345a3 ipv6: use spin_trylock_bh
Now there is spin_trylock_bh, use it rather than open coding.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:34:35 -07:00
Stephen Hemminger c8a4522245 ipv6: use round_jiffies
This timer normally happens once a minute, there is no need to cause an
early wakeup for it, so align it to next second boundary to safe power.
It can't be deferred because then it could take too long on cleanup or DoS.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:34:09 -07:00
Stephen Hemminger 417f28bb34 netns: dont alloc ipv6 fib timer list
FIB timer list is a trivial size structure, avoid indirection and just
put it in existing ns.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:33:45 -07:00
Adrian Bunk 888c848ed3 ipv6: make struct ipv6_devconf static
struct ipv6_devconf can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:21:58 -07:00
Adrian Bunk 4d6971e909 sctp: remove sctp_assoc_proc_exit()
Commit 20c2c1fd6c
(sctp: add sctp/remaddr table to complete RFC remote address table OID)
added an unused sctp_assoc_proc_exit() function that seems to have been 
unintentionally created when copying the assocs code.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:21:30 -07:00
Adrian Bunk abd0b198ea sctp: make sctp_outq_flush() static
sctp_outq_flush() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:20:45 -07:00
Adrian Bunk a94f779f9d pkt_sched: make qdisc_class_hash_alloc() static
This patch makes the needlessly global qdisc_class_hash_alloc() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:20:11 -07:00
David S. Miller cf508b1211 netdev: Handle ->addr_list_lock just like ->_xmit_lock for lockdep.
The new address list lock needs to handle the same device layering
issues that the _xmit_lock one does.

This integrates work done by Patrick McHardy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:16:42 -07:00
Dave Jones d29f749e25 net: Fix build failure with 'make mandocs'.
The function header comments have to go with the functions
they are documenting, or things go horribly wrong when we
try to process them with the docbook tools.

Warning(include/linux/netdevice.h:1006): No description found for parameter 'dev_queue'
Warning(include/linux/netdevice.h:1033): No description found for parameter 'dev_queue'
Warning(include/linux/netdevice.h:1067): No description found for parameter 'dev_queue'
Warning(include/linux/netdevice.h:1093): No description found for parameter 'dev_queue'
Warning(include/linux/netdevice.h:1474): No description found for parameter 'txq'
Error(net/core/dev.c:1674): cannot understand prototype: 'u32 simple_tx_hashrnd; '

Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:09:06 -07:00
Greg Kroah-Hartman 16be63fd16 bluetooth: remove improper bluetooth class symlinks.
Don't create symlinks in a class to a device that is not owned by the
class.  If the bluetooth subsystem really wants to point to all of the
devices it controls, it needs to create real devices, not fake symlinks.

Cc: Maxim Krasnyansky <maxk@qualcomm.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:54:51 -07:00
David S. Miller b32d13102d tcp: Fix bitmask test in tcp_syn_options()
As reported by Alexey Dobriyan:

	  CHECK   net/ipv4/tcp_output.c
	net/ipv4/tcp_output.c:475:7: warning: dubious: !x & y

And sparse is damn right!

	if (unlikely(!OPTION_TS & opts->options))
		    ^^^
		size += TCPOLEN_SACKPERM_ALIGNED;

OPTION_TS is (1 << 1), so condition will never trigger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 18:45:34 -07:00
Gerrit Renker 47112e25da udplite: Protection against coverage value wrap-around
This patch clamps the cscov setsockopt values to a maximum of 0xFFFF.

Setsockopt values greater than 0xffff can cause an unwanted
wrap-around.  Further, IPv6 jumbograms are not supported (RFC 3838,
3.5), so that values greater than 0xffff are not even useful.

Further changes: fixed a typo in the documentation.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 13:35:08 -07:00
Arjan van de Ven 6579e57b31 net: Print the module name as part of the watchdog message
As suggested by Dave:

This patch adds a function to get the driver name from a struct net_device,
and consequently uses this in the watchdog timeout handler to print as 
part of the message. 

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 13:31:48 -07:00
Stephen Hemminger 7943986ca1 net: use kcalloc in netdev_queue alloc
Minor nit, use size_t for allocation size and kcalloc to allocate
an array. Probably makes no actual code difference.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 13:28:44 -07:00
Stephen Hemminger 847499ce71 ipv6: use timer pending
This fixes the bridge reference count problem and cleanups ipv6 FIB
timer management.  Don't use expires field, because it is not a proper
way to test, instead use timer_pending().

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 13:21:35 -07:00
Patrick McHardy 5547cd0ae8 netfilter: nf_conntrack_sctp: fix sparse warnings
Introduced by a258860e (netfilter: ctnetlink: add full support for SCTP to ctnetlink):

net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: incorrect type in argument 1 (different base types)
net/netfilter/nf_conntrack_proto_sctp.c:483:2:    expected unsigned int [unsigned] [usertype] x
net/netfilter/nf_conntrack_proto_sctp.c:483:2:    got restricted unsigned int const <noident>
net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: incorrect type in argument 1 (different base types)
net/netfilter/nf_conntrack_proto_sctp.c:487:2:    expected unsigned int [unsigned] [usertype] x
net/netfilter/nf_conntrack_proto_sctp.c:487:2:    got restricted unsigned int const <noident>
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type
net/netfilter/nf_conntrack_proto_sctp.c:532:42: warning: incorrect type in assignment (different base types)
net/netfilter/nf_conntrack_proto_sctp.c:532:42:    expected restricted unsigned int <noident>
net/netfilter/nf_conntrack_proto_sctp.c:532:42:    got unsigned int
net/netfilter/nf_conntrack_proto_sctp.c:534:39: warning: incorrect type in assignment (different base types)
net/netfilter/nf_conntrack_proto_sctp.c:534:39:    expected restricted unsigned int <noident>
net/netfilter/nf_conntrack_proto_sctp.c:534:39:    got unsigned int

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:11:02 -07:00
Herbert Xu c71529e42c netfilter: nf_nat_sip: c= is optional for session
According to RFC2327, the connection information is optional
in the session description since it can be specified in the
media description instead.

My provider does exactly that and does not provide any connection
information in the session description.  As a result the new
kernel drops all invite responses.

This patch makes it optional as documented.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:11:02 -07:00
Jan Engelhardt db1a75bdcc netfilter: xt_TCPMSS: collapse tcpmss_reverse_mtu{4,6} into one function
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:11:01 -07:00
Eric Leblond 72961ecf84 netfilter: nfnetlink_log: send complete hardware header
This patch adds some fields to NFLOG to be able to send the complete
hardware header with all necessary informations.
It sends to userspace:
 * the type of hardware link
 * the lenght of hardware header
 * the hardware header

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:11:00 -07:00
David Howells 280763c053 netfilter: xt_time: fix time's time_mt()'s use of do_div()
Fix netfilter xt_time's time_mt()'s use of do_div() on an s64 by using
div_s64() instead.

This was introduced by patch ee4411a1b1
("[NETFILTER]: x_tables: add xt_time match").

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:59 -07:00
Krzysztof Piotr Oledzki 584015727a netfilter: accounting rework: ct_extend + 64bit counters (v4)
Initially netfilter has had 64bit counters for conntrack-based accounting, but
it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are
still required, for example for "connbytes" extension. However, 64bit counters
waste a lot of memory and it was not possible to enable/disable it runtime.

This patch:
 - reimplements accounting with respect to the extension infrastructure,
 - makes one global version of seq_print_acct() instead of two seq_print_counters(),
 - makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n),
 - makes it possible to enable/disable it at runtime by sysctl or sysfs,
 - extends counters from 32bit to 64bit,
 - renames ip_conntrack_counter -> nf_conn_counter,
 - enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT),
 - set initial accounting enable state based on CONFIG_NF_CT_ACCT
 - removes buggy IPCT_COUNTER_FILLING event handling.

If accounting is enabled newly created connections get additional acct extend.
Old connections are not changed as it is not possible to add a ct_extend area
to confirmed conntrack. Accounting is performed for all connections with
acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct".

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:58 -07:00
Changli Gao 0dbff689c2 netfilter: nf_nat_core: eliminate useless find_appropriate_src for IP_NAT_RANGE_PROTO_RANDOM
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:57 -07:00
David S. Miller d3678b463d Revert "pkt_sched: Make default qdisc nonshared-multiqueue safe."
This reverts commit a0c80b80e0.

After discussions with Jamal and Herbert on netdev, we should
provide at least minimal prioritization at the qdisc level
even in multiqueue situations.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:50 -07:00
Linus Torvalds 867d79fb9a net: In __netif_schedule() use WARN_ON instead of BUG_ON
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:49 -07:00
David S. Miller b6b2fed1f4 net: Improve simple_tx_hash().
Based upon feedback from Eric Dumazet and Andi Kleen.

Cure several deficiencies in simple_tx_hash() by using
jhash + reciprocol multiply.

1) Eliminates expensive modulus operation.

2) Makes hash less attackable by using random seed.

3) Eliminates endianness hash distribution issues.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 10:10:48 -07:00
Daniel Lezcano c3ee84163e pkt_sched: Remove unused variable skb in dev_deactivate_queue function.
Removed unused variable 'skb' in the dev_deactivate_queue function

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-21 09:18:07 -07:00
Ingo Molnar eb6a12c242 Merge branch 'linus' into cpus4096-for-linus
Conflicts:

	net/sunrpc/svc.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-21 17:19:50 +02:00
Linus Torvalds 14b395e35d Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
  nfsd: nfs4xdr.c do-while is not a compound statement
  nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
  lockd: Pass "struct sockaddr *" to new failover-by-IP function
  lockd: get host reference in nlmsvc_create_block() instead of callers
  lockd: minor svclock.c style fixes
  lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
  lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
  lockd: nlm_release_host() checks for NULL, caller needn't
  file lock: reorder struct file_lock to save space on 64 bit builds
  nfsd: take file and mnt write in nfs4_upgrade_open
  nfsd: document open share bit tracking
  nfsd: tabulate nfs4 xdr encoding functions
  nfsd: dprint operation names
  svcrdma: Change WR context get/put to use the kmem cache
  svcrdma: Create a kmem cache for the WR contexts
  svcrdma: Add flush_scheduled_work to module exit function
  svcrdma: Limit ORD based on client's advertised IRD
  svcrdma: Remove unused wait q from svcrdma_xprt structure
  svcrdma: Remove unneeded spin locks from __svc_rdma_free
  svcrdma: Add dma map count and WARN_ON
  ...
2008-07-20 21:21:46 -07:00
David Miller 702beb87d6 ipv6: Fix warning in addrconf code.
Reported by Linus.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-20 21:18:26 -07:00
Linus Torvalds d1671a9c15 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  pkt_sched: Fix build with NET_SCHED disabled.
2008-07-20 21:17:20 -07:00
David S. Miller 3a682fbd73 pkt_sched: Fix build with NET_SCHED disabled.
The stab bits can't be referenced uniless the full
packet scheduler layer is enabled.

Reported by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20 18:13:01 -07:00
Linus Torvalds db6d8c7a40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)
  iucv: Fix bad merging.
  net_sched: Add size table for qdiscs
  net_sched: Add accessor function for packet length for qdiscs
  net_sched: Add qdisc_enqueue wrapper
  highmem: Export totalhigh_pages.
  ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
  net: Use standard structures for generic socket address structures.
  ipv6 netns: Make several "global" sysctl variables namespace aware.
  netns: Use net_eq() to compare net-namespaces for optimization.
  ipv6: remove unused macros from net/ipv6.h
  ipv6: remove unused parameter from ip6_ra_control
  tcp: fix kernel panic with listening_get_next
  tcp: Remove redundant checks when setting eff_sacks
  tcp: options clean up
  tcp: Fix MD5 signatures for non-linear skbs
  sctp: Update sctp global memory limit allocations.
  sctp: remove unnecessary byteshifting, calculate directly in big-endian
  sctp: Allow only 1 listening socket with SO_REUSEADDR
  sctp: Do not leak memory on multiple listen() calls
  sctp: Support ipv6only AF_INET6 sockets.
  ...
2008-07-20 17:43:29 -07:00
Alan Cox a352def21a tty: Ldisc revamp
Move the line disciplines towards a conventional ->ops arrangement.  For
the moment the actual 'tty_ldisc' struct in the tty is kept as part of
the tty struct but this can then be changed if it turns out that when it
all settles down we want to refcount ldiscs separately to the tty.

Pull the ldisc code out of /proc and put it with our ldisc code.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-20 17:12:34 -07:00
David S. Miller fb65a7c091 iucv: Fix bad merging.
Noticed by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20 10:18:44 -07:00
Jussi Kivilinna 175f9c1bba net_sched: Add size table for qdiscs
Add size table functions for qdiscs and calculate packet size in
qdisc_enqueue().

Based on patch by Patrick McHardy
 http://marc.info/?l=linux-netdev&m=115201979221729&w=2

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20 00:08:47 -07:00
Jussi Kivilinna 0abf77e55a net_sched: Add accessor function for packet length for qdiscs
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20 00:08:27 -07:00
Jussi Kivilinna 5f86173bdf net_sched: Add qdisc_enqueue wrapper
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20 00:08:04 -07:00
YOSHIFUJI Hideaki a6ffb404dc ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
The caller has alredy checked for them.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 22:36:07 -07:00
YOSHIFUJI Hideaki 230b183921 net: Use standard structures for generic socket address structures.
Use sockaddr_storage{} for generic socket address storage
and ensures proper alignment.
Use sockaddr{} for pointers to omit several casts.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 22:35:47 -07:00
YOSHIFUJI Hideaki 53b7997fd5 ipv6 netns: Make several "global" sysctl variables namespace aware.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 22:35:03 -07:00
YOSHIFUJI Hideaki 721499e893 netns: Use net_eq() to compare net-namespaces for optimization.
Without CONFIG_NET_NS, namespace is always &init_net.
Compiler will be able to omit namespace comparisons with this patch.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 22:34:43 -07:00
David S. Miller 407d819cf0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6 2008-07-19 00:30:39 -07:00
Denis V. Lunev 725a8ff04a ipv6: remove unused parameter from ip6_ra_control
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:28:58 -07:00
Daniel Lezcano bdccc4ca13 tcp: fix kernel panic with listening_get_next
# BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
IP: [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
PGD 11e4b9067 PUD 11d16c067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 3
Modules linked in: bridge ipv6 button battery ac loop dm_mod tg3 ext3
jbd edd fan thermal processor thermal_sys hwmon sg sata_svw libata dock
serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 3368, comm: slpd Not tainted 2.6.26-rc2-mm1-lxc4 #1
RIP: 0010:[<ffffffff821ed01e>] [<ffffffff821ed01e>]
listening_get_next+0x50/0x1b3
RSP: 0018:ffff81011e1fbe18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8100be0ad3c0 RCX: ffff8100619f50c0
RDX: ffffffff82475be0 RSI: ffff81011d9ae6c0 RDI: ffff8100be0ad508
RBP: ffff81011f4f1240 R08: 00000000ffffffff R09: ffff8101185b6780
R10: 000000000000002d R11: ffffffff820fdbfa R12: ffff8100be0ad3c8
R13: ffff8100be0ad6a0 R14: ffff8100be0ad3c0 R15: ffffffff825b8ce0
FS: 00007f6a0ebd16d0(0000) GS:ffff81011f424540(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000011dc20000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process slpd (pid: 3368, threadinfo ffff81011e1fa000, task
ffff81011f4b8660)
Stack: 00000000000002ee ffff81011f5a57c0 ffff81011f4f1240
ffff81011e1fbe90
0000000000001000 0000000000000000 00007fff16bf2590 ffffffff821ed9c8
ffff81011f5a57c0 ffff81011d9ae6c0 000000000000041a ffffffff820b0abd
Call Trace:
[<ffffffff821ed9c8>] ? tcp_seq_next+0x34/0x7e
[<ffffffff820b0abd>] ? seq_read+0x1aa/0x29d
[<ffffffff820d21b4>] ? proc_reg_read+0x73/0x8e
[<ffffffff8209769c>] ? vfs_read+0xaa/0x152
[<ffffffff82097a7d>] ? sys_read+0x45/0x6e
[<ffffffff8200bd2b>] ? system_call_after_swapgs+0x7b/0x80


Code: 31 a9 25 00 e9 b5 00 00 00 ff 45 20 83 7d 0c 01 75 79 4c 8b 75 10
48 8b 0e eb 1d 48 8b 51 20 0f b7 45 08 39 02 75 0e 48 8b 41 28 <4c> 39
78 38 0f 84 93 00 00 00 48 8b 09 48 85 c9 75 de 8b 55 1c
RIP [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
RSP <ffff81011e1fbe18>
CR2: 0000000000000038

This kernel panic appears with CONFIG_NET_NS=y.

How to reproduce ?

    On the buggy host (host A)
       * ip addr add 1.2.3.4/24 dev eth0

    On a remote host (host B)
       * ip addr add 1.2.3.5/24 dev eth0
       * iptables -A INPUT -p tcp -s 1.2.3.4 -j DROP
       * ssh 1.2.3.4

    On host A:
       * netstat -ta or cat /proc/net/tcp

This bug happens when reading /proc/net/tcp[6] when there is a req_sock
at the SYN_RECV state.

When a SYN is received the minisock is created and the sk field is set to
NULL. In the listening_get_next function, we try to look at the field 
req->sk->sk_net.

When looking at how to fix this bug, I noticed that is useless to do
the check for the minisock belonging to the namespace. A minisock belongs
to a listen point and this one is per namespace, so when browsing the
minisock they are always per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:15:13 -07:00
Adam Langley 4389dded77 tcp: Remove redundant checks when setting eff_sacks
Remove redundant checks when setting eff_sacks and make the number of SACKs a
compile time constant. Now that the options code knows how many SACK blocks can
fit in the header, we don't need to have the SACK code guessing at it.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:07:02 -07:00
Adam Langley 33ad798c92 tcp: options clean up
This should fix the following bugs:
  * Connections with MD5 signatures produce invalid packets whenever SACK
    options are included
  * MD5 signatures are counted twice in the MSS calculations

Behaviour changes:
  * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK

    This is because we can't fit any SACK blocks in a packet with MD5 + TS
    options. There was discussion about disabling SACK rather than TS in
    order to fit in better with old, buggy kernels, but that was deemed to
    be unnecessary.

  * SYNs with MD5 don't include a TS option

    See above.

Additionally, it removes a bunch of duplicated logic for calculating options,
which should help avoid these sort of issues in the future.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:04:31 -07:00
Adam Langley 49a72dfb88 tcp: Fix MD5 signatures for non-linear skbs
Currently, the MD5 code assumes that the SKBs are linear and, in the case
that they aren't, happily goes off and hashes off the end of the SKB and
into random memory.

Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
Polyakov. Also includes a couple of missed route_caps from Stephen's patch
in [2].

[1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
[2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:01:42 -07:00
Vlad Yasevich 845525a642 sctp: Update sctp global memory limit allocations.
Update sctp global memory limit allocations to be the same as TCP.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:08:21 -07:00
Harvey Harrison 336d3262df sctp: remove unnecessary byteshifting, calculate directly in big-endian
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:07:09 -07:00
Vlad Yasevich 4e54064e0a sctp: Allow only 1 listening socket with SO_REUSEADDR
When multiple socket bind to the same port with SO_REUSEADDR,
only 1 can be listining.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:06:32 -07:00
Vlad Yasevich 23b29ed80b sctp: Do not leak memory on multiple listen() calls
SCTP permits multiple listen call and on subsequent calls
we leak he memory allocated for the crypto transforms.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:06:07 -07:00
Vlad Yasevich 7dab83de50 sctp: Support ipv6only AF_INET6 sockets.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:05:40 -07:00
Florian Westphal 6d0ccbac68 sctp: Prevent uninitialized memory access
valgrind reports uninizialized memory accesses when running
sctp inside the network simulation cradle simulator:

 Conditional jump or move depends on uninitialised value(s)
    at 0x570E34A: sctp_assoc_sync_pmtu (associola.c:1324)
    by 0x57427DA: sctp_packet_transmit (output.c:403)
    by 0x5710EFF: sctp_outq_flush (outqueue.c:824)
    by 0x5710B88: sctp_outq_uncork (outqueue.c:701)
    by 0x5745262: sctp_cmd_interpreter (sm_sideeffect.c:1548)
    by 0x57444B7: sctp_side_effects (sm_sideeffect.c:976)
    by 0x5744460: sctp_do_sm (sm_sideeffect.c:945)
    by 0x572157D: sctp_primitive_ASSOCIATE (primitive.c:94)
    by 0x5725C04: __sctp_connect (socket.c:1094)
    by 0x57297DC: sctp_connect (socket.c:3297)

 Conditional jump or move depends on uninitialised value(s)
    at 0x575D3A5: mod_timer (timer.c:630)
    by 0x5752B78: sctp_cmd_hb_timers_start (sm_sideeffect.c:555)
    by 0x5754133: sctp_cmd_interpreter (sm_sideeffect.c:1448)
    by 0x5753607: sctp_side_effects (sm_sideeffect.c:976)
    by 0x57535B0: sctp_do_sm (sm_sideeffect.c:945)
    by 0x571E9AE: sctp_endpoint_bh_rcv (endpointola.c:474)
    by 0x573347F: sctp_inq_push (inqueue.c:104)
    by 0x572EF93: sctp_rcv (input.c:256)
    by 0x5689623: ip_local_deliver_finish (ip_input.c:230)
    by 0x5689759: ip_local_deliver (ip_input.c:268)
    by 0x5689CAC: ip_rcv_finish (dst.h:246)

#1 is due to "if (t->pmtu_pending)".
8a4794914f "[SCTP] Flag a pmtu change request"
suggests it should be initialized to 0.

#2 is the heartbeat timer 'expires' value, which is uninizialised, but
test by mod_timer().
T3_rtx_timer seems to be affected by the same problem, so initialize it, too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:04:39 -07:00
Florian Westphal c4e85f82ed sctp: Don't abort initialization when CONFIG_PROC_FS=n
This puts CONFIG_PROC_FS defines around the proc init/exit functions
and also avoids compiling proc.c if procfs is not supported.
Also make SCTP_DBG_OBJCNT depend on procfs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:03:44 -07:00
Stephen Hemminger c1e20f7c8b tcp: RTT metrics scaling
Some of the metrics (RTT, RTTVAR and RTAX_RTO_MIN) are stored in
kernel units (jiffies) and this leaks out through the netlink API to
user space where the units for jiffies are unknown.

This patches changes the kernel to convert to/from milliseconds. This
changes the ABI, but milliseconds seemed like the most natural unit
for these parameters.  Values available via syscall in
/proc/net/rt_cache and netlink will be in milliseconds.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:02:15 -07:00
David S. Miller 30ee42be00 pkt_sched: Fix noqueue_qdisc initialization.
Like noop_qdisc, it needs a dummy backpointer and
explicit qdisc->q.lock initialization.

Based upon a report by Stephen Hemminger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 23:00:11 -07:00
David S. Miller 3072367300 pkt_sched: Manage qdisc list inside of root qdisc.
Idea is from Patrick McHardy.

Instead of managing the list of qdiscs on the device level, manage it
in the root qdisc of a netdev_queue.  This solves all kinds of
visibility issues during qdisc destruction.

The way to iterate over all qdiscs of a netdev_queue is to visit
the netdev_queue->qdisc, and then traverse it's list.

The only special case is to ignore builting qdiscs at the root when
dumping or doing a qdisc_lookup().  That was not needed previously
because builtin qdiscs were not added to the device's qdisc_list.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 22:50:15 -07:00
David S. Miller 72b25a913e pkt_sched: Get rid of u32_list.
The u32_list is just an indirect way of maintaining a reference
to a U32 node on a per-qdisc basis.

Just add an explicit node pointer for u32 to struct Qdisc an do
away with this global list.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 20:54:17 -07:00
Patrick McHardy 8913336a7e packet: add PACKET_RESERVE sockopt
Add new sockopt to reserve some headroom in the mmaped ring frames in
front of the packet payload. This can be used f.i. when the VLAN header
needs to be (re)constructed to avoid moving the entire payload.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 18:05:19 -07:00
Mike Travis 65c0118453 cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
    with new cpumask_of_cpu_ptr macros.  These are patterned after the
    node_to_cpumask_ptr macros.

    In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
    the cpumask_of_cpu_map[cpu] entry is used.  The cpumask_of_cpu_map
    is provided when there is a large NR_CPUS count, reducing
    greatly the amount of code generated and stack space used for
    cpumask_of_cpu().  The pointer to the cpumask_t value is needed for
    calling set_cpus_allowed_ptr() to reduce the amount of stack space
    needed to pass the cpumask_t value.

    If there isn't a cpumask_of_cpu_map[], then a temporary variable is
    declared and filled in with value from cpumask_of_cpu(cpu) as well as
    a pointer variable pointing to this temporary variable.  Afterwards,
    the pointer is used to reference the cpumask value.  The compiler
    will optimize out the extra dereference through the pointer as well
    as the stack space used for the pointer, resulting in identical code.

    A good example of the orthogonal usages is in net/sunrpc/svc.c:

	case SVC_POOL_PERCPU:
	{
		unsigned int cpu = m->pool_to[pidx];
		cpumask_of_cpu_ptr(cpumask, cpu);

		*oldmask = current->cpus_allowed;
		set_cpus_allowed_ptr(current, cpumask);
		return 1;
	}
	case SVC_POOL_PERNODE:
	{
		unsigned int node = m->pool_to[pidx];
		node_to_cpumask_ptr(nodecpumask, node);

		*oldmask = current->cpus_allowed;
		set_cpus_allowed_ptr(current, nodecpumask);
		return 1;
	}

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 22:02:57 +02:00
Ingo Molnar bb2c018b09 Merge branch 'linus' into cpus4096
Conflicts:

	drivers/acpi/processor_throttling.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 22:00:54 +02:00
Pavel Emelyanov b6fcbdb4f2 proc: consolidate per-net single-release callers
They are symmetrical to single_open ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:07:44 -07:00
Pavel Emelyanov de05c557b2 proc: consolidate per-net single_open callers
There are already 7 of them - time to kill some duplicate code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:07:21 -07:00
Pavel Emelyanov 60bdde9580 proc: clean the ip_misc_proc_init and ip_proc_init_net error paths
After all this stuff is moved outside, this function can look better.

Besides, I tuned the error path in ip_proc_init_net to make it have
only 2 exit points, not 3.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:06:50 -07:00
Pavel Emelyanov 8e3461d01b proc: show per-net ip_devconf.forwarding in /proc/net/snmp
This one has become per-net long ago, but the appropriate file
is per-net only now.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:06:26 -07:00
Pavel Emelyanov 229bf0cbaa proc: create /proc/net/snmp file in each net
All the statistics shown in this file have been made per-net already.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:06:04 -07:00
Pavel Emelyanov 7b7a9dfdf6 proc: create /proc/net/netstat file in each net
Now all the shown in it statistics is netnsizated, time to
show it in appropriate net.

The appropriate net init/exit ops already exist - they make
the sockstat file per net - so just extend them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:05:17 -07:00
Pavel Emelyanov d89cbbb1e6 ipv4: clean the init_ipv4_mibs error paths
After moving all the stuff outside this function it looks
a bit ugly - make it look better.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:04:51 -07:00
Pavel Emelyanov 923c6586b0 mib: put icmpmsg statistics on struct net
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:04:22 -07:00
Pavel Emelyanov b60538a0d7 mib: put icmp statistics on struct net
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:04:02 -07:00
Pavel Emelyanov 386019d351 mib: put udplite statistics on struct net
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:03:45 -07:00
Pavel Emelyanov 2f275f91a4 mib: put udp statistics on struct net
Similar to... ouch, I repeat myself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:03:27 -07:00
Pavel Emelyanov 61a7e26028 mib: put net statistics on struct net
Similar to ip and tcp ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:03:08 -07:00
Pavel Emelyanov a20f5799ca mib: put ip statistics on struct net
Similar to tcp one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:02:42 -07:00
Pavel Emelyanov 57ef42d59d mib: put tcp statistics on struct net
Proc temporary uses stats from init_net.

BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:02:08 -07:00
Pavel Emelyanov 9b4661bd6e ipv4: add pernet mib operations
These ones are currently empty, but stuff from init_ipv4_mibs will
sequentially migrate there.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:01:44 -07:00
David S. Miller 49997d7515 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	Documentation/powerpc/booting-without-of.txt
	drivers/atm/Makefile
	drivers/net/fs_enet/fs_enet-main.c
	drivers/pci/pci-acpi.c
	net/8021q/vlan.c
	net/iucv/iucv.c
2008-07-18 02:39:39 -07:00
David S. Miller a0c80b80e0 pkt_sched: Make default qdisc nonshared-multiqueue safe.
Instead of 'pfifo_fast' we have just plain 'fifo_fast'.
No priority queues, just a straight FIFO.

This is necessary in order to legally have a seperate
qdisc per queue in multi-TX-queue setups, and thus get
full parallelization.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:33 -07:00
David S. Miller 99194cff39 pkt_sched: Add multiqueue handling to qdisc_graft().
Move the destruction of the old queue into qdisc_graft().

When operating on a root qdisc (ie. "parent == NULL"), apply
the operation to all queues.  The caller has grabbed a single
implicit reference for this graft, therefore when we apply the
change to more than one queue we must grab additional qdisc
references.

Otherwise, we are operating on a class of a specific parent qdisc, and
therefore no multiqueue handling is necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:30 -07:00
David S. Miller 8387400092 pkt_sched: Kill netdev_queue lock.
We can simply use the qdisc->q.lock for all of the
qdisc tree synchronization.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:30 -07:00
David S. Miller c7e4f3bbb4 pkt_sched: Kill qdisc_lock_tree and qdisc_unlock_tree.
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:29 -07:00
David S. Miller 53049978df pkt_sched: Make qdisc grafting locking more specific.
Lock the root of the qdisc being operated upon.

All explicit references to qdisc_tree_lock() are now gone.
The only remaining uses are via the sch_tree_{lock,unlock}()
and tcf_tree_{lock,unlock}() macros.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:27 -07:00
David S. Miller ead81cc5fc netdevice: Move qdisc_list back into net_device proper.
And give it it's own lock.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:26 -07:00
David S. Miller 15b458fa65 pkt_sched: Kill qdisc_lock_tree usage in cls_route.c
It just wants the qdisc tree to be synchronized, so grabbing
qdisc_root_lock() is sufficient.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:25 -07:00
David S. Miller 55dbc640c3 pkt_sched: Remove qdisc_lock_tree usage in cls_api.c
It just wants the qdisc tree for the filter to be synchronized.
So just BH lock qdisc_root_lock(q) instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:24 -07:00
David S. Miller 17715e62a5 pkt_sched: Use per-queue locking in shutdown_scheduler_queue.
This eliminates another qdisc_lock_tree user.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:23 -07:00
David S. Miller 8a34c5dc3a pkt_sched: Perform bulk of qdisc destruction in RCU.
This allows less strict control of access to the qdisc attached to a
netdev_queue.  It is even allowed to enqueue into a qdisc which is
in the process of being destroyed.  The RCU handler will toss out
those packets.

We will need this to handle sharing of a qdisc amongst multiple
TX queues.  In such a setup the lock has to be shared, so will
be inside of the qdisc itself.  At which point the netdev_queue
lock cannot be used to hard synchronize access to the ->qdisc
pointer.

One operation we have to keep inside of qdisc_destroy() is the list
deletion.  It is the only piece of state visible after the RCU quiesce
period, so we have to undo it early and under the appropriate locking.

The operations in the RCU handler do not need any looking because the
qdisc tree is no longer visible to anything at that point.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:22 -07:00
David S. Miller 16361127eb pkt_sched: dev_init_scheduler() does not need to lock qdisc tree.
We are registering the device, there is no way anyone can get
at this object's qdiscs yet in any meaningful way.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:21 -07:00
David S. Miller 37437bb2e1 pkt_sched: Schedule qdiscs instead of netdev_queue.
When we have shared qdiscs, packets come out of the qdiscs
for multiple transmit queues.

Therefore it doesn't make any sense to schedule the transmit
queue when logically we cannot know ahead of time the TX
queue of the SKB that the qdisc->dequeue() will give us.

Just for sanity I added a BUG check to make sure we never
get into a state where the noop_qdisc is scheduled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:20 -07:00
David S. Miller 7698b4fcab pkt_sched: Add and use qdisc_root() and qdisc_root_lock().
When code wants to lock the qdisc tree state, the logic
operation it's doing is locking the top-level qdisc that
sits of the root of the netdev_queue.

Add qdisc_root_lock() to represent this and convert the
easiest cases.

In order for this to work out in all cases, we have to
hook up the noop_qdisc to a dummy netdev_queue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:19 -07:00
David S. Miller e2627c8c22 pkt_sched: Make QDISC_RUNNING a qdisc state.
Currently it is associated with a netdev_queue, but when we have
qdisc sharing that no longer makes any sense.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:18 -07:00
David S. Miller d3b753db7c pkt_sched: Move gso_skb into Qdisc.
We liberate any dangling gso_skb during qdisc destruction.

It really only matters for the root qdisc.  But when qdiscs
can be shared by multiple netdev_queue objects, we can't
have the gso_skb in the netdev_queue any more.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:18 -07:00
David S. Miller 8f0f2223cc net: Implement simple sw TX hashing.
It just xor hashes over IPv4/IPv6 addresses and ports of transport.

The only assumption it makes is that skb_network_header() is set
correctly.

With bug fixes from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:13 -07:00
David S. Miller 51cb6db0f5 mac80211: Reimplement WME using ->select_queue().
The only behavior change is that we do not drop packets under any
circumstances.  If that is absolutely needed, we could easily add it
back.

With cleanups and help from Johannes Berg.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:12 -07:00
David S. Miller eae792b722 netdev: Add netdev->select_queue() method.
Devices or device layers can set this to control the queue selection
performed by dev_pick_tx().

This function runs under RCU protection, which allows overriding
functions to have some way of synchronizing with things like dynamic
->real_num_tx_queues adjustments.

This makes the spinlock prefetch in dev_queue_xmit() a little bit
less effective, but that's the price right now for correctness.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:10 -07:00
David S. Miller fd2ea0a79f net: Use queue aware tests throughout.
This effectively "flips the switch" by making the core networking
and multiqueue-aware drivers use the new TX multiqueue structures.

Non-multiqueue drivers need no changes.  The interfaces they use such
as netif_stop_queue() degenerate into an operation on TX queue zero.
So everything "just works" for them.

Code that really wants to do "X" to all TX queues now invokes a
routine that does so, such as netif_tx_wake_all_queues(),
netif_tx_stop_all_queues(), etc.

pktgen and netpoll required a little bit more surgery than the others.

In particular the pktgen changes, whilst functional, could be largely
improved.  The initial check in pktgen_xmit() will sometimes check the
wrong queue, which is mostly harmless.  The thing to do is probably to
invoke fill_packet() earlier.

The bulk of the netpoll changes is to make the code operate solely on
the TX queue indicated by by the SKB queue mapping.

Setting of the SKB queue mapping is entirely confined inside of
net/core/dev.c:dev_pick_tx().  If we end up needing any kind of
special semantics (drops, for example) it will be implemented here.

Finally, we now have a "real_num_tx_queues" which is where the driver
indicates how many TX queues are actually active.

With IGB changes from Jeff Kirsher.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:07 -07:00
David S. Miller 24344d2600 mac80211: Temporarily mark QoS support BROKEN.
We will undo this after a few changsets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:05 -07:00
David S. Miller 1d8ae3fdeb pkt_sched: Remove RR scheduler.
This actually fixes a bug added by the RR scheduler changes.  The
->bands and ->prio2band parameters were being set outside of the
sch_tree_lock() and thus could result in strange behavior and
inconsistencies.

It might be possible, in the new design (where there will be one qdisc
per device TX queue) to allow similar functionality via a TX hash
algorithm for RR but I really see no reason to export this aspect of
how these multiqueue cards actually implement the scheduling of the
the individual DMA TX rings and the single physical MAC/PHY port.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:04 -07:00
David S. Miller 09e83b5d7d netdev: Kill NETIF_F_MULTI_QUEUE.
There is no need for a feature bit for something that
can be tested by simply checking the TX queue count.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:03 -07:00
David S. Miller e8a0464cc9 netdev: Allocate multiple queues for TX.
alloc_netdev_mq() now allocates an array of netdev_queue
structures for TX, based upon the queue_count argument.

Furthermore, all accesses to the TX queues are now vectored
through the netdev_get_tx_queue() and netdev_for_each_tx_queue()
interfaces.  This makes it easy to grep the tree for all
things that want to get to a TX queue of a net device.

Problem spots which are not really multiqueue aware yet, and
only work with one queue, can easily be spotted by grepping
for all netdev_get_tx_queue() calls that pass in a zero index.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17 19:21:00 -07:00
Patrick McHardy 51ce7ec921 garp: retry sending JoinIn messages after allocation failures
Increase reliability by retrying to send JoinIn messages after memory
allocation failures on each TRANSMIT_PDU event until it succeeds.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:51:47 -07:00
Neil Horman 9a6d276e85 core: add stat to track unresolved discards in neighbor cache
in __neigh_event_send, if we have a neighbour entry which is in
NUD_INCOMPLETE state, we enqueue any outbound frames to that neighbour
to the neighbours arp_queue, which is default capped to a length of 3
skbs.  If that queue exceeds its set length, it will drop an skb on
the queue to enqueue the newly arrived skb.  This results in a drop
for which we have no statistics incremented.  This patch adds an
unresolved_discards stat to /proc/net/stat/ndisc_cache to track these
lost frames.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:50:49 -07:00
Pavel Emelyanov ed88098e25 mib: add net to NET_ADD_STATS_USER
Done with NET_XXX_STATS macros :)

To be continued...

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:32:45 -07:00
Pavel Emelyanov f2bf415cfe mib: add net to NET_ADD_STATS_BH
This one is tricky. 

The thing is that this macro is only used when killing tw buckets, 
but since this killer is promiscuous wrt to which net each particular
tw belongs to, I have to use it only when NET_NS is off. When the net
namespaces are on, I use the INET_INC_STATS_BH for each bucket.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:32:25 -07:00
Pavel Emelyanov 6f67c817fc mib: add net to NET_INC_STATS_USER
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:31:39 -07:00
Pavel Emelyanov de0744af1f mib: add net to NET_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:31:16 -07:00
Pavel Emelyanov 4e6734447d mib: add net to NET_INC_STATS
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:30:14 -07:00
Pavel Emelyanov 1ed834655a tcp: replace tcp_sock argument with sock in some places
These places have a tcp_sock, but we'd prefer the sock itself to
get net from it. Fortunately, tcp_sk macro is just a type cast, so
this replace is really cheap.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:29:51 -07:00
Pavel Emelyanov ca12a1a443 inet: prepare net on the stack for NET accounting macros
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:28:42 -07:00
Pavel Emelyanov 5c52ba170f sock: add net to prot->enter_memory_pressure callback
The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't
have where to get the net from.

I decided to add a sk argument, not the net itself, only to factor
all the required sock_net(sk) calls inside the enter_memory_pressure 
callback itself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:28:10 -07:00
Pavel Emelyanov 74688e487a mib: add net to TCP_DEC_STATS
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:22:46 -07:00
Pavel Emelyanov 63231bddf6 mib: add net to TCP_INC_STATS_BH
Same as before - the sock is always there to get the net from,
but there are also some places with the net already saved on 
the stack.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:22:25 -07:00
Pavel Emelyanov 81cc8a75d9 mib: add net to TCP_INC_STATS
Fortunately (almost) all the TCP code has a sock to get the net from :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:22:04 -07:00
Pavel Emelyanov a9c19329ec tcp: add net to tcp_mib_init
This one sets TCP MIBs after zeroing them, and thus requires
the net.

The existing single caller can use init_net (temporarily).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:21:42 -07:00
Pavel Emelyanov a86b1e3019 inet: prepare struct net for TCP MIB accounting
This is the same as the first patch in the set, but preparing
the net for TCP_XXX_STATS - save the struct net on the stack
where required and possible.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:20:58 -07:00
Pavel Emelyanov c5346fe396 mib: add net to IP_ADD_STATS_BH
Very simple - only ip_evictor (fragments) requires such.
This patch ends up the IP_XXX_STATS patching.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:20:33 -07:00
Pavel Emelyanov 7c73a6faff mib: add net to IP_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:20:11 -07:00
Pavel Emelyanov 5e38e27044 mib: add net to IP_INC_STATS
All the callers already have either the net itself, or the place
where to get it from.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:19:49 -07:00
Pavel Emelyanov 84a3aa000e ipv4: prepare net initialization for IP accounting
Some places, that deal with IP statistics already have where to
get a struct net from, but use it directly, without declaring
a separate variable on the stack.

So, save this net on the stack for future IP_XXX_STATS macros.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:19:08 -07:00
Will Newton 70efce27fc net/ipv4/tcp.c: Fix use of PULLHUP instead of POLLHUP in comments.
Change PULLHUP to POLLHUP in tcp_poll comments and clean up another
comment for grammar and coding style.

Signed-off-by: Will Newton <will.newton@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:13:43 -07:00
Harvey Harrison 7b1c65faa2 net: make __skb_splice_bits static
net/core/skbuff.c:1335:5: warning: symbol '__skb_splice_bits' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:12:30 -07:00
David S. Miller 885a4c966b Merge branch 'stealer/ipvs/sync-daemon-cleanup-for-next' of git://git.stealer.net/linux-2.6 2008-07-16 20:07:06 -07:00
Rumen G. Bogdanovski 9d3a0de7dc ipvs: More reliable synchronization on connection close
This patch enhances the synchronization of the closing connections
between the master and the backup director. It prevents the closed
connections to expire with the 15 min timeout of the ESTABLISHED
state on the backup and makes them expire as they would do on the
master with much shorter timeouts.

Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:04:23 -07:00
Sven Wegener 375c6bbabf ipvs: Use schedule_timeout_interruptible() instead of msleep_interruptible()
So that kthread_stop() can wake up the thread and we don't have to wait one
second in the worst case for the daemon to actually stop.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
2008-07-16 22:33:20 +00:00
Sven Wegener ba6fd85021 ipvs: Put backup thread on mcast socket wait queue
Instead of doing an endless loop with sleeping for one second, we now put the
backup thread onto the mcast socket wait queue and it gets woken up as soon as
we have data to process.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
2008-07-16 22:33:20 +00:00
Sven Wegener 998e7a7680 ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()
This also moves the setup code out of the daemons, so that we're able to
return proper error codes to user space. The current code will return success
to user space when the daemon is started with an invald mcast interface. With
these changes we get an appropriate "No such device" error.

We longer need our own completion to be sure the daemons are actually running,
because they no longer contain code that can fail and kthread_run() takes care
of the rest.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
2008-07-16 22:33:20 +00:00
Sven Wegener e6dd731c75 ipvs: Use ERR_PTR for returning errors from make_receive_sock() and make_send_sock()
The additional information we now return to the caller is currently not used,
but will be used to return errors to user space.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
2008-07-16 22:33:19 +00:00
Sven Wegener d56400504a ipvs: Initialize mcast addr at compile time
There's no need to do it at runtime, the values are constant.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
2008-07-16 22:33:19 +00:00
Trond Myklebust cadc723cc1 Merge branch 'bkl-removal' into next 2008-07-15 18:34:58 -04:00
Trond Myklebust e89e896d31 Merge branch 'devel' into next
Conflicts:

	fs/nfs/file.c

Fix up the conflict with Jon Corbet's bkl-removal tree
2008-07-15 18:34:16 -04:00
Ingo Molnar 82638844d9 Merge branch 'linus' into cpus4096
Conflicts:

	arch/x86/xen/smp.c
	kernel/sched_rt.c
	net/iucv/iucv.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 00:29:07 +02:00
Trond Myklebust a86dc496b7 SUNRPC: Remove the BKL from the callback functions
Push it into those callback functions that actually need it.

Note that all the NFS operations use their own locking, so don't need the
BKL. Ditto for the rpcbind client.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:57 -04:00
Chuck Lever c2e1b09ff2 SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon
Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.

Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon.  The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP.  All of this functionality is exposed via
the new rpcb_v4_register() kernel API.

A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.

Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:55 -04:00
Chuck Lever babe80eb49 SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier
rpcbind version 4 registration will reuse part of rpcb_register, so just
split it out into a separate function now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:51 -04:00
Chuck Lever 423d8b0647 SUNRPC: None of rpcb_create's callers wants a privileged source port
Clean up: Callers that required a privileged source port now use
rpcb_create_local(), so we can remove the @privileged argument from
rpcb_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:47 -04:00
Chuck Lever cc5598b78f SUNRPC: Introduce a specific rpcb_create for contacting localhost
Add rpcb_create_local() for use by rpcb_register() and upcoming IPv6
registration functions.

Ensure any errors encountered by rpcb_create_local() are properly
reported.

We can also use a statically allocated constant loopback socket address
instead of one allocated on the stack and initialized every time the
function is called.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:44 -04:00
Chuck Lever 166b88d755 SUNRPC: Use correct XDR encoding procedure for rpcbind SET/UNSET
The rpcbind versions 3 and 4 SET and UNSET procedures use the same
arguments as the GETADDR procedure.

While definitely a bug, this hasn't been a problem so far since the
kernel hasn't used version 3 or 4 SET and UNSET.  But this will change
in just a moment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:40 -04:00
Linus Torvalds 59190f4213 Merge branch 'generic-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'generic-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (22 commits)
  generic-ipi: more merge fallout
  generic-ipi: merge fix
  x86, visws: use mach-default/entry_arch.h
  x86, visws: fix generic-ipi build
  generic-ipi: fixlet
  generic-ipi: fix s390 build bug
  generic-ipi: fix linux-next tree build failure
  fix: "smp_call_function: get rid of the unused nonatomic/retry argument"
  fix: "smp_call_function: get rid of the unused nonatomic/retry argument"
  fix "smp_call_function: get rid of the unused nonatomic/retry argument"
  on_each_cpu(): kill unused 'retry' parameter
  smp_call_function: get rid of the unused nonatomic/retry argument
  sh: convert to generic helpers for IPI function calls
  parisc: convert to generic helpers for IPI function calls
  mips: convert to generic helpers for IPI function calls
  m32r: convert to generic helpers for IPI function calls
  arm: convert to generic helpers for IPI function calls
  alpha: convert to generic helpers for IPI function calls
  ia64: convert to generic helpers for IPI function calls
  powerpc: convert to generic helpers for IPI function calls
  ...

Fix trivial conflicts due to rcu updates in kernel/rcupdate.c manually
2008-07-15 14:12:03 -07:00
Ingo Molnar 1a781a777b Merge branch 'generic-ipi' into generic-ipi-for-linus
Conflicts:

	arch/powerpc/Kconfig
	arch/s390/kernel/time.c
	arch/x86/kernel/apic_32.c
	arch/x86/kernel/cpu/perfctr-watchdog.c
	arch/x86/kernel/i8259_64.c
	arch/x86/kernel/ldt.c
	arch/x86/kernel/nmi_64.c
	arch/x86/kernel/smpboot.c
	arch/x86/xen/smp.c
	include/asm-x86/hw_irq_32.h
	include/asm-x86/hw_irq_64.h
	include/asm-x86/mach-default/irq_vectors.h
	include/asm-x86/mach-voyager/irq_vectors.h
	include/asm-x86/smp.h
	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-15 21:55:59 +02:00
Ingo Molnar 6c9fcaf2ee Merge branch 'core/rcu' into core/rcu-for-linus 2008-07-15 21:10:12 +02:00
Akinobu Mita d0236f8f82 iucv: fix memory leak in cpu hotplug error path.
Fix memory leak in error path in CPU_UP_PREPARE notifier.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-15 02:09:53 -07:00
Octavian Purdila 2870c43d17 net: refactor tcp splice receive path to improve readability
- move all of the details on offsets, lengths and buffers into a
single function instead of doing these operation from multiple places

- use a bottom up approach: try to avoid details in the high level
functions, introduce them gradually as we go deeper in the function
call stack

With helpful feedback from Jarek Poplawski.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-15 00:49:11 -07:00
David S. Miller b9e4085768 netdev: Do not use TX lock to protect address lists.
Now that we have a specific lock to protect the network
device unicast and multicast lists, remove extraneous
grabs of the TX lock in cases where the code only needs
address list protection.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-15 00:15:08 -07:00
David S. Miller e308a5d806 netdev: Add netdev->addr_list_lock protection.
Add netif_addr_{lock,unlock}{,_bh}() helpers.

Use them to protect operations that operate on or read
the network device unicast and multicast address lists.

Also use them in cases where the code simply wants to
block calls into the driver's ->set_rx_mode() and
->set_multicast_list() methods.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-15 00:13:44 -07:00
David S. Miller f1f28aa351 netdev: Add addr_list_lock to struct net_device.
This will be used to protect the per-device unicast and multicast
address lists, as well as the callbacks into the drivers which
configure such state such as ->set_rx_mode() and ->set_multicast_list().

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-15 00:08:33 -07:00
Pavel Emelyanov f66ac03d49 mib: add struct net to ICMPMSGIN_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:31 -07:00
Pavel Emelyanov 903fc1964e mib: add struct net to ICMPMSGOUT_INC_STATS
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:30 -07:00
Pavel Emelyanov dcfc23cac1 mib: add struct net to ICMP_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:29 -07:00
Pavel Emelyanov 75c939bb4d mib: add struct net to ICMP_INC_STATS
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:28 -07:00
Pavel Emelyanov fd54d716b1 inet: toss struct net initialization around
Some places, that deal with ICMP statistics already have where
to get a struct net from, but use it directly, without declaring
a separate variable on the stack.

Since I will need this net soon, I declare a struct net on the
stack and use it in the existing places in a separate patch not
to spoil the future ones.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:26 -07:00
Pavel Emelyanov 0388b00426 icmp: add struct net argument to icmp_out_count
This routine deals with ICMP statistics, but doesn't have a
struct net at hands, so add one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 23:05:13 -07:00
Patrick McHardy 61362766d7 vlan: remove unnecessary include statements
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:51:55 -07:00
Patrick McHardy d80aa31bbf vlan: clean up hard_start_xmit functions
Remove excessive comments and debugging, use NETDEV_TX codes,
remove some empty lines.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:51:39 -07:00
Patrick McHardy 1349fe9a6b vlan: clean up vlan_dev_hard_header()
Remove some debugging and excessive comments, merge the two
dev_hard_header calls into one.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:51:19 -07:00
Patrick McHardy 19b9a4e256 vlan: ethtool ->get_flags support
Allow to query LRO settings of underlying device when VLAN RX
acceleration is used.

Suggested by Ben Hutchings <bhutchings@solarflare.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:51:01 -07:00
Patrick McHardy 393e52e33c packet: deliver VLAN TCI to userspace
Store the VLAN tag in the auxillary data/tpacket2_hdr so userspace can
properly deal with hardware VLAN tagging/stripping.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:50:39 -07:00
Patrick McHardy bbd6ef87c5 packet: support extensible, 64 bit clean mmaped ring structure
The tpacket_hdr is not 64 bit clean due to use of an unsigned long
and can't be extended because the following struct sockaddr_ll needs
to be at a fixed offset.

Add support for a version 2 tpacket protocol that removes these
limitations.

Userspace can query the header size through a new getsockopt option
and change the protocol version through a setsockopt option. The
changes needed to switch to the new protocol version are:

1. replace struct tpacket_hdr by struct tpacket2_hdr
2. query header len and save
3. set protocol version to 2
 - set up ring as usual
4. for getting the sockaddr_ll, use (void *)hdr + TPACKET_ALIGN(hdrlen)
   instead of (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))

Steps 2 and 4 can be omitted if the struct sockaddr_ll isn't needed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:50:15 -07:00
Patrick McHardy bc1d0411b8 vlan: deliver packets received with VLAN acceleration to network taps
When VLAN header stripping is used, packets currently bypass packet
sockets (and other network taps) completely. For locally existing
VLANs, they appear directly on the VLAN device, for unknown VLANs
they are silently dropped.

Add a new function netif_nit_deliver() to deliver incoming packets
to all network interface taps and use it in __vlan_hwaccel_rx() to
make VLAN packets visible on the underlying device.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:49:30 -07:00
Patrick McHardy 6aa895b047 vlan: Don't store VLAN tag in cb
Use a real skb member to store the skb to avoid clashes with qdiscs,
which are allowed to use the cb area themselves. As currently only real
devices that consume the skb set the NETIF_F_HW_VLAN_TX flag, no explicit
invalidation is neccessary.

The new member fills a hole on 64 bit, the skb layout changes from:

        __u32                      mark;                 /*   172     4 */
        sk_buff_data_t             transport_header;     /*   176     4 */
        sk_buff_data_t             network_header;       /*   180     4 */
        sk_buff_data_t             mac_header;           /*   184     4 */
        sk_buff_data_t             tail;                 /*   188     4 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        sk_buff_data_t             end;                  /*   192     4 */

        /* XXX 4 bytes hole, try to pack */

to

        __u32                      mark;                 /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */

        /* XXX 2 bytes hole, try to pack */

        sk_buff_data_t             transport_header;     /*   180     4 */
        sk_buff_data_t             network_header;       /*   184     4 */

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:49:06 -07:00
Allan Stephens 968edbe1c8 tipc: Optimization to multicast name lookup algorithm
This patch simplifies and speeds up TIPC's algorithm for identifying
on-node and off-node destinations that overlap a multicast name
sequence range.  Rather than traversing the list of all known name
publications within the cluster, it now traverses the (potentially
much shorter) list of name publications made by the node itself, and
determines if any off-node destinations exist by comparing the sizes
of the two lists.  (Since the node list must be a subset of the
cluster list, a difference in sizes means that at least one off-node
destination must exist.)

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:45:33 -07:00
Allan Stephens 1aad72d6cd tipc: Add missing locks when inspecting node list & link list
This patch ensures that TIPC configuration commands that display info
about neighboring nodes and their links take the spinlocks that
protect the node list and link lists from changing while the lists
are being traversed.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:44:58 -07:00
Allan Stephens 08d2cf0f74 tipc: Fix bug in scope checking for multicast messages
This patch ensures that TIPC's multicast message name lookup
algorithm does individualized scope checking for each published
name it examines.  Previously, scope checking was only done for
the first name in a name table publication list, which could
result in incoming multicast messages being delivered to ports
publishing names with "node" scope, or not being delivered to
ports publishing names with "cluster" or "zone" scope.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:44:32 -07:00
Allan Stephens 0e35fd5e52 tipc: Eliminate improper use of TIPC_OK error code
This patch corrects many places where TIPC routines indicated
successful completion by returning TIPC_OK instead of 0.
(The TIPC_OK symbol has the value 0, but it should only be used
in contexts that deal with the error code field of a TIPC
message header.)

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:44:01 -07:00
Allan Stephens 2da59918e2 tipc: Fix race condition that could cause accept() to fail
This patch ensurs that accept() returns successfully even when
the newly created socket is immediately disconnected by its peer.
Previously, accept() would fail if it was unable to pass back
the optional address info for the socket's peer before the
socket became disconnected; TIPC now allows accept() to gather
peer address information after disconnection.  As a bonus, the
revised code accesses the socket's port more efficiently, without
the overhead incurred by a reference table lookup.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:43:32 -07:00
Allan Stephens 8642bd9e04 tipc: Optimize pointer dereferencing when receiving stream data
This patch eliminates an unnecessary pointer dereference when
accessing a stream-based socket's receive queue.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:42:51 -07:00
Allan Stephens 0ea522416b tipc: Remove unneeded parameter to tipc_createport_raw()
This patch eliminates an unneeded parameter when creating a low-level
TIPC port object.  Instead of returning both the pointer to the port
structure and the port's reference ID, it now returns only the pointer
since the port structure contains the reference ID as one of its fields.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:42:19 -07:00
Denis V. Lunev 83aa2e964b netlabel: return msg overflow error from netlbl_cipsov4_list faster
Currently, we are trying to place the information from the kernel to
1, 2, 3 and 4 pages sequentially. These pages are allocated via slab.
Though, from the slab point of view steps 3 and 4 are equivalent on
most architectures. So, lets skip 3 pages attempt.

By the way, should we switch from .doit to .dumpit interface here?
The amount of data seems quite big for me.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:28:25 -07:00
Johann Felix Soden 7197914c35 net: Remove references to wan-router.txt in Kconfigs
This patch removes references in drivers/net/wan/Kconfig and
net/wanrouter/Kconfig to Documentation/networking/wan-router.txt
which was removed in commit 99971e70fd
("[WANPIPE]: Forgotten bits of Sangoma drivers removal.").

Signed-off-by: Johann Felix Soden <johfel@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:22:29 -07:00
Wang Chen 89146504cb 8021q: Check return of dev_set_promiscuity/allmulti
dev_set_promiscuity/allmulti might overflow.
Commit: "netdevice: Fix promiscuity and allmulti overflow" in net-next makes
dev_set_promiscuity/allmulti return error number if overflow happened.

Here, we check all positive increment for promiscuity and allmulti
to get error return.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:59:03 -07:00
Wang Chen 7dc00c82cb ipv4: Fix ipmr unregister device oops
An oops happens during device unregister.

The following oops happened when I add two tunnels, which
use a same device, and then delete one tunnel.
Obviously deleting tunnel "A" causes device unregister, which
send a notification, and after receiving notification, ipmr do
unregister again for tunnel "B" which also use same device.
That is wrong.
After receiving notification, ipmr only needs to decrease reference
count and don't do duplicated unregister.
Fortunately, IPv6 side doesn't add tunnel in ip6mr, so it's clean.

This patch fixs:
- unregister device oops
- using after dev_put()

Here is the oops:
===
Jul 11 15:39:29 wangchen kernel: ------------[ cut here ]------------
Jul 11 15:39:29 wangchen kernel: kernel BUG at net/core/dev.c:3651!
Jul 11 15:39:29 wangchen kernel: invalid opcode: 0000 [#1] 
Jul 11 15:39:29 wangchen kernel: Modules linked in: ipip tunnel4 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device af_packet binfmt_misc button battery ac loop dm_mod usbhid ff_memless pcmcia firmware_class ohci1394 8139too mii ieee1394 yenta_socket rsrc_nonstatic pcmcia_core ide_cd_mod cdrom snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm i2c_i801 snd_timer snd i2c_core soundcore snd_page_alloc rng_core shpchp ehci_hcd uhci_hcd pci_hotplug intel_agp agpgart usbcore ext3 jbd ata_piix ahci libata dock edd fan thermal processor thermal_sys piix sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Jul 11 15:39:29 wangchen kernel: 
Jul 11 15:39:29 wangchen kernel: Pid: 4102, comm: mroute Not tainted (2.6.26-rc9-default #69)
Jul 11 15:39:29 wangchen kernel: EIP: 0060:[<c024636b>] EFLAGS: 00010202 CPU: 0
Jul 11 15:39:29 wangchen kernel: EIP is at rollback_registered+0x61/0xe3
Jul 11 15:39:29 wangchen kernel: EAX: 00000001 EBX: ecba6000 ECX: 00000000 EDX: ffffffff
Jul 11 15:39:29 wangchen kernel: ESI: 00000001 EDI: ecba6000 EBP: c03de2e8 ESP: ed8e7c3c
Jul 11 15:39:29 wangchen kernel:  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Jul 11 15:39:29 wangchen kernel: Process mroute (pid: 4102, ti=ed8e6000 task=ed41e830 task.ti=ed8e6000)
Jul 11 15:39:29 wangchen kernel: Stack: ecba6000 c024641c 00000028 c0284e1a 00000001 c03de2e8 ecba6000 eecff360 
Jul 11 15:39:29 wangchen kernel:        c0284e4c c03536f4 fffffff8 00000000 c029a819 ecba6000 00000006 ecba6000 
Jul 11 15:39:29 wangchen kernel:        00000000 ecba6000 c03de2c0 c012841b ffffffff 00000000 c024639f ecba6000 
Jul 11 15:39:29 wangchen kernel: Call Trace:
Jul 11 15:39:29 wangchen kernel:  [<c024641c>] unregister_netdevice+0x2f/0x51
Jul 11 15:39:29 wangchen kernel:  [<c0284e1a>] vif_delete+0xaf/0xc3
Jul 11 15:39:29 wangchen kernel:  [<c0284e4c>] ipmr_device_event+0x1e/0x30
Jul 11 15:39:29 wangchen kernel:  [<c029a819>] notifier_call_chain+0x2a/0x47
Jul 11 15:39:29 wangchen kernel:  [<c012841b>] raw_notifier_call_chain+0x9/0xc
Jul 11 15:39:29 wangchen kernel:  [<c024639f>] rollback_registered+0x95/0xe3
Jul 11 15:39:29 wangchen kernel:  [<c024641c>] unregister_netdevice+0x2f/0x51
Jul 11 15:39:29 wangchen kernel:  [<c0284e1a>] vif_delete+0xaf/0xc3
Jul 11 15:39:29 wangchen kernel:  [<c0285eee>] ip_mroute_setsockopt+0x47a/0x801
Jul 11 15:39:29 wangchen kernel:  [<eea5a70c>] do_get_write_access+0x2df/0x313 [jbd]
Jul 11 15:39:29 wangchen kernel:  [<c01727c4>] __find_get_block_slow+0xda/0xe4
Jul 11 15:39:29 wangchen kernel:  [<c0172a7f>] __find_get_block+0xf8/0x122
Jul 11 15:39:29 wangchen kernel:  [<c0172a7f>] __find_get_block+0xf8/0x122
Jul 11 15:39:29 wangchen kernel:  [<eea5d563>] journal_cancel_revoke+0xda/0x110 [jbd]
Jul 11 15:39:29 wangchen kernel:  [<c0263501>] ip_setsockopt+0xa9/0x9ee
Jul 11 15:39:29 wangchen kernel:  [<eea5d563>] journal_cancel_revoke+0xda/0x110 [jbd]
Jul 11 15:39:29 wangchen kernel:  [<eea5a70c>] do_get_write_access+0x2df/0x313 [jbd]
Jul 11 15:39:29 wangchen kernel:  [<eea69287>] __ext3_get_inode_loc+0xcf/0x271 [ext3]
Jul 11 15:39:29 wangchen kernel:  [<eea743c7>] __ext3_journal_dirty_metadata+0x13/0x32 [ext3]
Jul 11 15:39:29 wangchen kernel:  [<c0116434>] __wake_up+0xf/0x15
Jul 11 15:39:29 wangchen kernel:  [<eea5a424>] journal_stop+0x1bd/0x1c6 [jbd]
Jul 11 15:39:29 wangchen kernel:  [<eea703a7>] __ext3_journal_stop+0x19/0x34 [ext3]
Jul 11 15:39:29 wangchen kernel:  [<c014291e>] get_page_from_freelist+0x94/0x369
Jul 11 15:39:29 wangchen kernel:  [<c01408f2>] filemap_fault+0x1ac/0x2fe
Jul 11 15:39:29 wangchen kernel:  [<c01a605e>] security_sk_alloc+0xd/0xf
Jul 11 15:39:29 wangchen kernel:  [<c023edea>] sk_prot_alloc+0x36/0x78
Jul 11 15:39:29 wangchen kernel:  [<c0240037>] sk_alloc+0x3a/0x40
Jul 11 15:39:29 wangchen kernel:  [<c0276062>] raw_hash_sk+0x46/0x4e
Jul 11 15:39:29 wangchen kernel:  [<c0166aff>] d_alloc+0x1b/0x157
Jul 11 15:39:29 wangchen kernel:  [<c023e4d1>] sock_common_setsockopt+0x12/0x16
Jul 11 15:39:29 wangchen kernel:  [<c023cb1e>] sys_setsockopt+0x6f/0x8e
Jul 11 15:39:29 wangchen kernel:  [<c023e105>] sys_socketcall+0x15c/0x19e
Jul 11 15:39:29 wangchen kernel:  [<c0103611>] sysenter_past_esp+0x6a/0x99
Jul 11 15:39:29 wangchen kernel:  [<c0290000>] unix_poll+0x69/0x78
Jul 11 15:39:29 wangchen kernel:  =======================
Jul 11 15:39:29 wangchen kernel: Code: 83 e0 01 00 00 85 c0 75 1f 53 53 68 12 81 31 c0 e8 3c 30 ed ff ba 3f 0e 00 00 b8 b9 7f 31 c0 83 c4 0c 5b e9 f5 26 ed ff 48 74 04 <0f> 0b eb fe 89 d8 e8 21 ff ff ff 89 d8 e8 62 ea ff ff c7 83 e0 
Jul 11 15:39:29 wangchen kernel: EIP: [<c024636b>] rollback_registered+0x61/0xe3 SS:ESP 0068:ed8e7c3c
Jul 11 15:39:29 wangchen kernel: ---[ end trace c311acf85d169786 ]---
===

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:56:34 -07:00
Wang Chen d607032db0 ipv4: Check return of dev_set_allmulti
allmulti might overflow.
Commit: "netdevice: Fix promiscuity and allmulti overflow" in net-next makes
dev_set_promiscuity/allmulti return error number if overflow happened.

Here, we check the positive increment for allmulti to get error return.

PS: For unwinding tunnel creating, we let ipip->ioctl() to handle it.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:55:26 -07:00
Wang Chen 7af3db78a9 ipv6: Fix using after dev_put()
Patrick McHardy pointed it out.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:54:54 -07:00
Wang Chen 5ae7b44413 ipv6: Check return of dev_set_allmulti
allmulti might overflow.
Commit: "netdevice: Fix promiscuity and allmulti overflow" in net-next makes
dev_set_promiscuity/allmulti return error number if overflow happened.

Here, we check the positive increment for allmulti to get error return.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Acked-by: Patrick McHardy <kaber@trash.net> 
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:54:23 -07:00
Wang Chen bc3f9076f6 bridge: Check return of dev_set_promiscuity
dev_set_promiscuity/allmulti might overflow.
Commit: "netdevice: Fix promiscuity and allmulti overflow" in net-next makes
dev_set_promiscuity/allmulti return error number if overflow happened.

Here, we check the positive increment for promiscuity to get error return.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:53:13 -07:00
Wang Chen 2aeb0b88b3 af_packet: Check return of dev_set_promiscuity/allmulti
dev_set_promiscuity/allmulti might overflow.  Commit: "netdevice: Fix
promiscuity and allmulti overflow" in net-next makes
dev_set_promiscuity/allmulti return error number if overflow happened.

In af_packet, we check all positive increment for promiscuity and
allmulti to get error return.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Acked-by: Patrick McHardy <kaber@trash.net> 
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:49:46 -07:00
David S. Miller fc943b12e4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 2008-07-14 20:40:34 -07:00
Patrick McHardy 72d9794f44 net-sched: cls_flow: add perturbation support
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:36:32 -07:00
David S. Miller 0c4c8cae44 Merge branch 'master' of git://eden-feed.erg.abdn.ac.uk/net-next-2.6 2008-07-14 20:32:07 -07:00
David S. Miller 2aec609fb4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	net/netfilter/nf_conntrack_proto_tcp.c
2008-07-14 20:23:54 -07:00
David S. Miller 4c88949800 netfilter: Let nf_ct_kill() callers know if del_timer() returned true.
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 20:22:38 -07:00
Linus Torvalds 666484f025 Merge branch 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softirq: remove irqs_disabled warning from local_bh_enable
  softirq: remove initialization of static per-cpu variable
  Remove argument from open_softirq which is always NULL
2008-07-14 15:28:42 -07:00
Linus Torvalds d1794f2c5b Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6
* 'bkl-removal' of git://git.lwn.net/linux-2.6: (146 commits)
  IB/umad: BKL is not needed for ib_umad_open()
  IB/uverbs: BKL is not needed for ib_uverbs_open()
  bf561-coreb: BKL unneeded for open()
  Call fasync() functions without the BKL
  snd/PCM: fasync BKL pushdown
  ipmi: fasync BKL pushdown
  ecryptfs: fasync BKL pushdown
  Bluetooth VHCI: fasync BKL pushdown
  tty_io: fasync BKL pushdown
  tun: fasync BKL pushdown
  i2o: fasync BKL pushdown
  mpt: fasync BKL pushdown
  Remove BKL from remote_llseek v2
  Make FAT users happier by not deadlocking
  x86-mce: BKL pushdown
  vmwatchdog: BKL pushdown
  vmcp: BKL pushdown
  via-pmu: BKL pushdown
  uml-random: BKL pushdown
  uml-mmapper: BKL pushdown
  ...
2008-07-14 14:48:31 -07:00
Jonathan Corbet 2fceef397f Merge commit 'v2.6.26' into bkl-removal 2008-07-14 15:29:34 -06:00
Emmanuel Grumbach 1e18863790 mac80211: dont add a STA which is not in the same IBSS
This patch avoids adding STAs that don't belong to our IBSS
ieee80211_bssid_match matches also bcast address so also APs
were added

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:57 -04:00
Johannes Berg f434b2d111 mac80211: fix struct ieee80211_tx_queue_params
Multiple issues:
 - there are no "default" values needed
 - cw_min/cw_max can be larger than documented
 - restructure to decrease size
 - use get_unaligned_le16

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:57 -04:00
Johannes Berg f591fa5dbb mac80211: fix TX sequence numbers
This patch makes mac80211 assign proper sequence numbers to
QoS-data frames. It also removes the old sequence number code
because we noticed that only the driver or hardware can assign
sequence numbers to non-QoS-data and especially management
frames in a race-free manner because beacons aren't passed
through mac80211's TX path.

This patch also adds temporary code to the rt2x00 drivers to
not break them completely, that code will have to be reworked
for proper sequence numbers on beacons.

It also moves sequence number assignment down in the TX path
so no sequence numbers are assigned to frames that are dropped.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:57 -04:00
Johannes Berg 22bb1be4d2 wext: make sysfs bits optional and deprecate them
The /sys/class/net/*/wireless/ direcory is, as far as I know, not
used by anyone. Additionally, the same data is available via wext
ioctls. Hence the sysfs files are pretty much useless. This patch
makes them optional and schedules them for removal.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Jean Tourrilhes <jt@hpl.hp.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:57 -04:00
Johannes Berg 1411f9b531 mac80211: fix RX sequence number check
According to 802.11-2007, we are doing the wrong thing in the
sequence number checks when receiving frames. This fixes it.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:57 -04:00
Emmanuel Grumbach 2560b6e2e4 mac80211: Fix ieee80211_rx_reorder_ampdu: ignore QoS null packets
This patch fixes the check at the entrance to ieee80211_rx_reorder_ampdu.
This check has been broken by 'mac80211: rx.c use new helpers'.

Letting QoS NULL packet in ieee80211_rx_reorder_ampdu led to packet loss in
RX.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:52:56 -04:00
Johannes Berg 9d139c810a mac80211: revamp beacon configuration
This patch changes mac80211's beacon configuration handling
to never pass skbs to the driver directly but rather always
require the driver to use ieee80211_beacon_get(). Additionally,
it introduces "change flags" on the config_interface() call
to enable drivers to figure out what is changing. Finally, it
removes the beacon_update() driver callback in favour of
having IBSS beacon delivered by ieee80211_beacon_get() as well.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:30:07 -04:00
Johannes Berg f3947e2dfa mac80211: push interface checks down
This patch pushes the "netif_running()" and "same type as before"
checks down into ieee80211_if_change_type() to centralise the
logic instead of duplicating it for cfg80211 and wext.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:30:07 -04:00
Johannes Berg 75636525fb mac80211: revamp virtual interface handling
This patch revamps the virtual interface handling and makes the
code much easier to follow. Fewer functions, better names, less
spaghetti code.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:30:07 -04:00
Johannes Berg 3e122be089 mac80211: make master netdev handling sane
Currently, almost every interface type has a 'bss' pointer
pointing to BSS information. This BSS information, however,
is for a _local_ BSS, not for the BSS we joined, so having
it on a STA mode interface makes little sense, but now they
have it pointing to the master device, which is an AP mode
virtual interface. However, except for some bitrate control
data, this pointer is only used in AP/VLAN modes (for power
saving stations.)

Overall, it is not necessary to even have the master netdev
be a valid virtual interface, and it doesn't have to be on
the list of interfaces either.

This patch changes the master netdev to be special, it now
 - no longer is on the list of virtual interfaces, which
   lets me remove a lot of tests for that
 - no longer has sub_if_data attached, since that isn't used

Additionally, this patch changes some vlan/ap mode handling
that is related to these 'bss' pointers described above (but
in the VLAN case they actually make sense because there they
point to the AP they belong to); it also adds some debugging
code to IEEE80211_DEV_TO_SUB_IF to validate it is not called
on the master netdev any more.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:30:06 -04:00
Samuel Ortiz 49292d5635 mac80211: power management wext hooks
This patch implements the power management routines wireless extensions
for mac80211.
For now we only support switching PS mode between on and off.

Signed-off-by: Samuel Ortiz <sameo@openedhand.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-14 14:30:06 -04:00
Marcel Holtmann b1235d7961 [Bluetooth] Allow security for outgoing L2CAP connections
When requested the L2CAP layer will now enforce authentication and
encryption on outgoing connections. The usefulness of this feature
is kinda limited since it will not allow proper connection ownership
tracking until the authentication procedure has been finished. This
is a limitation of Bluetooth 2.0 and before and can only be fixed by
using Simple Pairing.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:54 +02:00
Marcel Holtmann 7cb127d5b0 [Bluetooth] Add option to disable eSCO connection creation
It has been reported that some eSCO capable headsets are not able to
connect properly. The real reason for this is unclear at the moment. So
for easier testing add a module parameter to disable eSCO connection
creation.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:53 +02:00
Marcel Holtmann ec8dab36e0 [Bluetooth] Signal user-space for HIDP and BNEP socket errors
When using the HIDP or BNEP kernel support, the user-space needs to
know if the connection has been terminated for some reasons. Wake up
the application if that happens. Otherwise kernel and user-space are
no longer on the same page and weird behaviors can happen.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:53 +02:00
Marcel Holtmann a0c22f2265 [Bluetooth] Move pending packets from RFCOMM socket to TTY
When an incoming RFCOMM socket connection gets converted into a TTY,
it can happen that packets are lost. This mainly happens with the
Handsfree profile where the remote side starts sending data right
away. The problem is that these packets are in the socket receive
queue. So when creating the TTY make sure to copy all pending packets
from the socket receive queue to a private queue inside the TTY.

To make this actually work, the flow control on the newly created TTY
will be disabled and only enabled again when the TTY is opened by an
application. And right before that, the pending packets will be put
into the TTY flip buffer.

Signed-off-by: Denis Kenzior <denis.kenzior@trolltech.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:52 +02:00
Marcel Holtmann 8b6b3da765 [Bluetooth] Store remote modem status for RFCOMM TTY
When switching a RFCOMM socket to a TTY, the remote modem status might
be needed later. Currently it is lost since the original configuration
is done via the socket interface. So store the modem status and reply
it when the socket has been converted to a TTY.

Signed-off-by: Denis Kenzior <denis.kenzior@trolltech.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:52 +02:00
Marcel Holtmann ca37bdd53b [Bluetooth] Use non-canonical TTY by default for RFCOMM
While the RFCOMM TTY emulation can act like a real serial port, in
reality it is not used like this. So to not mess up stupid applications,
use the non-canonical mode by default.

Signed-off-by: Denis Kenzior <denis.kenzior@trolltech.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:52 +02:00
Marcel Holtmann 78c6a1744f [Bluetooth] Update Bluetooth core version number
With all the Bluetooth 2.1 changes and the support for Simple Pairing,
it is important to update the Bluetooth core version number.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:51 +02:00
Marcel Holtmann 7d0db0a373 [Bluetooth] Use a more unique bus name for connections
When attaching Bluetooth low-level connections to the bus, the bus name
is constructed from the remote address since at that time the connection
handle is not assigned yet. This has worked so far, but also caused a
lot of troubles. It is better to postpone the creation of the sysfs
entry to the time when the connection actually has been established
and then use its connection handle as unique identifier.

This also fixes the case where two different adapters try to connect
to the same remote device.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:51 +02:00
Marcel Holtmann 43cbeee9f9 [Bluetooth] Add support for TIOCOUTQ and TIOCINQ ioctls
Almost every protocol family supports the TIOCOUTQ and TIOCINQ ioctls
and even Bluetooth could make use of them. When implementing audio
streaming and integration with GStreamer or PulseAudio they will allow
a better timing and synchronization.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:51 +02:00
Marcel Holtmann 3241ad820d [Bluetooth] Add timestamp support to L2CAP, RFCOMM and SCO
Enable the common timestamp functionality that the network subsystem
provides for L2CAP, RFCOMM and SCO sockets. It is possible to either
use SO_TIMESTAMP or the IOCTLs to retrieve the timestamp of the
current packet.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:50 +02:00
Marcel Holtmann 40be492fe4 [Bluetooth] Export details about authentication requirements
With the Simple Pairing support, the authentication requirements are
an explicit setting during the bonding process. Track and enforce the
requirements and allow higher layers like L2CAP and RFCOMM to increase
them if needed.

This patch introduces a new IOCTL that allows to query the current
authentication requirements. It is also possible to detect Simple
Pairing support in the kernel this way.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:50 +02:00
Marcel Holtmann f8558555f3 [Bluetooth] Initiate authentication during connection establishment
With Bluetooth 2.1 and Simple Pairing the requirement is that any new
connection needs to be authenticated and that encryption has been
switched on before allowing L2CAP to use it. So make sure that all
the requirements are fulfilled and otherwise drop the connection with
a minimal disconnect timeout of 10 milliseconds.

This change only affects Bluetooth 2.1 devices and Simple Pairing
needs to be enabled locally and in the remote host stack. The previous
changes made sure that these information are discovered before any
kind of authentication and encryption is triggered.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:49 +02:00
Marcel Holtmann 769be974d0 [Bluetooth] Use ACL config stage to retrieve remote features
The Bluetooth technology introduces new features on a regular basis
and for some of them it is important that the hardware on both sides
support them. For features like Simple Pairing it is important that
the host stacks on both sides have switched this feature on. To make
valid decisions, a config stage during ACL link establishment has been
introduced that retrieves remote features and if needed also the remote
extended features (known as remote host features) before signalling
this link as connected.

This change introduces full reference counting of incoming and outgoing
ACL links and the Bluetooth core will disconnect both if no owner of it
is present. To better handle interoperability during the pairing phase
the disconnect timeout for incoming connections has been increased to
10 seconds. This is five times more than for outgoing connections.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:49 +02:00
Marcel Holtmann a8bd28baf2 [Bluetooth] Export remote Simple Pairing mode via sysfs
Since the remote Simple Pairing mode is stored together with the
inquiry cache, it makes sense to show it together with the other
information.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:49 +02:00
Marcel Holtmann 41a96212b3 [Bluetooth] Track status of remote Simple Pairing mode
The Simple Pairing process can only be used if both sides have the
support enabled in the host stack. The current Bluetooth specification
has three ways to detect this support.

If an Extended Inquiry Result has been sent during inquiry then it
is safe to assume that Simple Pairing is enabled. It is not allowed
to enable Extended Inquiry without Simple Pairing. During the remote
name request phase a notification with the remote host supported
features will be sent to indicate Simple Pairing support. Also the
second page of the remote extended features can indicate support for
Simple Pairing.

For all three cases the value of remote Simple Pairing mode is stored
in the inquiry cache for later use.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:48 +02:00
Marcel Holtmann 333140b57f [Bluetooth] Track status of Simple Pairing mode
The Simple Pairing feature is optional and needs to be enabled by the
host stack first. The Linux kernel relies on the Bluetooth daemon to
either enable or disable it, but at any time it needs to know the
current state of the Simple Pairing mode. So track any changes made
by external entities and store the current mode in the HCI device
structure.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:48 +02:00
Marcel Holtmann 0493684ed2 [Bluetooth] Disable disconnect timer during Simple Pairing
During the Simple Pairing process the HCI disconnect timer must be
disabled. The way to do this is by holding a reference count of the
HCI connection. The Simple Pairing process on both sides starts with
an IO Capabilities Request and ends with Simple Pairing Complete.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:48 +02:00
Marcel Holtmann c7bdd5026d [Bluetooth] Update class of device value whenever possible
The class of device value can only be retrieved via inquiry or during
an incoming connection request. Outgoing connections can't ask for the
class of device. To compensate for this the value is stored and copied
via the inquiry cache, but currently only updated via inquiry. This
update should also happen during an incoming connection request.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:47 +02:00
Marcel Holtmann f383f2750a [Bluetooth] Some cleanups for HCI event handling
Some minor cosmetic cleanups to the HCI event handling to make the
code easier to read and understand.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:47 +02:00
Marcel Holtmann e4e8e37c42 [Bluetooth] Make use of the default link policy settings
The Bluetooth specification supports the default link policy settings
on a per host controller basis. For every new connection the link
manager would then use these settings. It is better to use this instead
of bothering the controller on every connection setup to overwrite the
default settings.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:47 +02:00
Marcel Holtmann a8746417e8 [Bluetooth] Track connection packet type changes
The connection packet type can be changed after the connection has been
established and thus needs to be properly tracked to ensure that the
host stack has always correct and valid information about it.

On incoming connections the Bluetooth core switches the supported packet
types to the configured list for this controller. However the usefulness
of this feature has been questioned a lot. The general consent is that
every Bluetooth host stack should enable as many packet types as the
hardware actually supports and leave the decision to the link manager
software running on the Bluetooth chip.

When running on Bluetooth 2.0 or later hardware, don't change the packet
type for incoming connections anymore. This hardware likely supports
Enhanced Data Rate and thus leave it completely up to the link manager
to pick the best packet type.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:46 +02:00
Marcel Holtmann 9dc0a3afc0 [Bluetooth] Support the case when headset falls back to SCO link
When trying to establish an eSCO link between two devices then it can
happen that the remote device falls back to a SCO link. Currently this
case is not handled correctly and the message dispatching will break
since it is looking for eSCO packets. So in case the configured link
falls back to SCO overwrite the link type with the correct value.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:46 +02:00
Marcel Holtmann ae29319649 [Bluetooth] Update authentication status after successful encryption
The authentication status is not communicated to both parties. This is
actually a flaw in the Bluetooth specification. Only the requesting side
really knows if the authentication was successful or not. This piece of
information is however needed on the other side to know if it has to
trigger the authentication procedure or not. Worst case is that both
sides will request authentication at different times, but this should
be avoided since it costs extra time when setting up a new connection.

For Bluetooth encryption it is required to authenticate the link first
and the encryption status is communicated to both sides. So when a link
is switched to encryption it is possible to update the authentication
status since it implies an authenticated link.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:45 +02:00
Marcel Holtmann 9719f8afce [Bluetooth] Disconnect when encryption gets disabled
The Bluetooth specification allows to enable or disable the encryption
of an ACL link at any time by either the peer or the remote device. If
a L2CAP or RFCOMM connection requested an encrypted link, they will now
disconnect that link if the encryption gets disabled. Higher protocols
that don't care about encryption (like SDP) are not affected.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:45 +02:00
Marcel Holtmann 77db198056 [Bluetooth] Enforce security for outgoing RFCOMM connections
Recent tests with various Bluetooth headsets have shown that some of
them don't enforce authentication and encryption when connecting. All
of them leave it up to the host stack to enforce it. Non of them should
allow unencrypted connections, but that is how it is. So in case the
link mode settings require authentication and/or encryption it will now
also be enforced on outgoing RFCOMM connections. Previously this was
only done for incoming connections.

This support has a small drawback from a protocol level point of view
since the host stack can't really tell with 100% certainty if a remote
side is already authenticated or not. So if both sides are configured
to enforce authentication it will be requested twice. Most Bluetooth
chips are caching this information and thus no extra authentication
procedure has to be triggered over-the-air, but it can happen.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:45 +02:00
Marcel Holtmann 79d554a697 [Bluetooth] Change retrieval of L2CAP features mask
Getting the remote L2CAP features mask is really important, but doing
this as less intrusive as possible is tricky. To play nice with older
systems and Bluetooth qualification testing, the features mask is now
only retrieved in two specific cases and only once per lifetime of an
ACL link.

When trying to establish a L2CAP connection and the remote features mask
is unknown, the L2CAP information request is sent when the ACL link goes
into connected state. This applies only to outgoing connections and also
only for the connection oriented channels.

The second case is when a connection request has been received. In this
case a connection response with the result pending and the information
request will be send. After receiving an information response or if the
timeout gets triggered, the normal connection setup process with security
setup will be initiated.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:44 +02:00
Ursula Braun c2b4afd2f9 [S390] Cleanup iucv printk messages.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Ursula Braun <braunu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2008-07-14 10:02:20 +02:00
Gerrit Renker 2eeea7ba6b dccp ccid-3: Length of loss intervals
This corrects an error in the computation of the open loss interval I_0:
  * the interval length is (highest_seqno - start_seqno) + 1
  * and not (highest_seqno - start_seqno).

This condition was not fully clear in RFC 3448, but reflects the current
revision state of rfc3448bis and is also consistent with RFC 4340, 6.1.1.

Further changes:
----------------
 * variable renamed due to line length constraints;
 * explicit typecast to `s64' to avoid implicit signed/unsigned casting.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-07-13 11:51:40 +01:00
Gerrit Renker b552c6231f dccp ccid-3: Fix a loss detection bug
This fixes a bug in the logic of the TFRC loss detection:
 * new_loss_indicated() should not be called while a loss is pending;
 * but the code allows this;
 * thus, for two subsequent gaps in the sequence space, when loss_count
   has not yet reached NDUPACK=3, the loss_count is falsely reduced to 1.

To avoid further and similar problems, all loss handling and loss detection is
now done inside tfrc_rx_hist_handle_loss(), using an appropriate routine to
track new losses.

Further changes:
----------------
 * added a reminder that no RX history operations should be performed when
   rx_handle_loss() has identified a (new) loss, since the function takes
   care of packet reordering during loss detection;
 * made tfrc_rx_hist_loss_pending() bool (thanks to an earlier suggestion
   by Arnaldo);		 
 * removed unused functions.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-07-13 11:51:40 +01:00
Gerrit Renker 5b5d0e7048 dccp: Upgrade NDP count from 3 to 6 bytes
RFC 4340, 7.7 specifies up to 6 bytes for the NDP Count option, whereas the code
is currently limited to up to 3 bytes. This seems to be a relict of an earlier 
draft version and is brought up to date by the patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-07-13 11:51:40 +01:00
Gerrit Renker 2013c7e35a dccp ccid-3: Fix error in loss detection
The TFRC loss detection code used the wrong loss condition (RFC 4340, 7.7.1):
 * the difference between sequence numbers s1 and s2 instead of 
 * the number of packets missing between s1 and s2 (one less than the distance).

Since this condition appears in many places of the code, it has been put into a
separate function, dccp_loss_free().

Further changes:
----------------
 * tidied up incorrect typing (it was using `int' for u64/s64 types);
 * optimised conditional statements for common case of non-reordered packets;
 * rewrote comments/documentation to match the changes.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-07-13 11:51:40 +01:00
Ingo Molnar 0c81b2a144 Merge branch 'linus' into core/rcu
Conflicts:

	include/linux/rculist.h
	kernel/rcupreempt.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 10:46:50 +02:00
Linus Torvalds e5a5816f78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  tun: Persistent devices can get stuck in xoff state
  xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
  ipv6: missed namespace context in ipv6_rthdr_rcv
  netlabel: netlink_unicast calls kfree_skb on error path by itself
  ipv4: fib_trie: Fix lookup error return
  tcp: correct kcalloc usage
  ip: sysctl documentation cleanup
  Documentation: clarify tcp_{r,w}mem sysctl docs
  netfilter: nf_nat_snmp_basic: fix a range check in NAT for SNMP
  netfilter: nf_conntrack_tcp: fix endless loop
  libertas: fix memory alignment problems on the blackfin
  zd1211rw: stop beacons on remove_interface
  rt2x00: Disable synchronization during initialization
  rc80211_pid: Fix fast_start parameter handling
  sctp: Add documentation for sctp sysctl variable
  ipv6: fix race between ipv6_del_addr and DAD timer
  irda: Fix netlink error path return value
  irda: New device ID for nsc-ircc
  irda: via-ircc proper dma freeing
  sctp: Mark the tsn as received after all allocations finish
  ...
2008-07-10 17:58:47 -07:00
Steffen Klassert ccf9b3b83d xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
Add a XFRM_STATE_AF_UNSPEC flag to handle the AF_UNSPEC behavior for
the selector family. Userspace applications can set this flag to leave
the selector family of the xfrm_state unspecified.  This can be used
to to handle inter family tunnels if the selector is not set from
userspace.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-10 16:55:37 -07:00
Denis V. Lunev 0ce28553cc ipv6: missed namespace context in ipv6_rthdr_rcv
Signed-off-by: Denis V. Lunev <den@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-10 16:54:50 -07:00
Denis V. Lunev fe785bee05 netlabel: netlink_unicast calls kfree_skb on error path by itself
So, no need to kfree_skb here on the error path. In this case we can
simply return.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-10 16:53:39 -07:00
Ben Hutchings 2e655571c6 ipv4: fib_trie: Fix lookup error return
In commit a07f5f508a "[IPV4] fib_trie: style
cleanup", the changes to check_leaf() and fn_trie_lookup() were wrong - where
fn_trie_lookup() would previously return a negative error value from
check_leaf(), it now returns 0.
 
Now fn_trie_lookup() doesn't appear to care about plen, so we can revert
check_leaf() to returning the error value.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Tested-by: William Boughton <bill@boughton.de>
Acked-by: Stephen Heminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-10 16:52:52 -07:00
Milton Miller 3d8ea1fd70 tcp: correct kcalloc usage
kcalloc is supposed to be called with the count as its first argument and
the element size as the second.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-10 16:51:32 -07:00
David S. Miller 2ddddb9869 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 2008-07-09 15:10:09 -07:00
David Howells 252815b0cf netfilter: nf_nat_snmp_basic: fix a range check in NAT for SNMP
Fix a range check in netfilter IP NAT for SNMP to always use a big enough size
variable that the compiler won't moan about comparing it to ULONG_MAX/8 on a
64-bit platform.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-09 15:06:45 -07:00
Patrick McHardy 6b69fe0c73 netfilter: nf_conntrack_tcp: fix endless loop
When a conntrack entry is destroyed in process context and destruction
is interrupted by packet processing and the packet is an attempt to
reopen a closed connection, TCP conntrack tries to kill the old entry
itself and returns NF_REPEAT to pass the packet through the hook
again. This may lead to an endless loop: TCP conntrack repeatedly
finds the old entry, but can not kill it itself since destruction
is already in progress, but destruction in process context can not
complete since TCP conntrack is keeping the CPU busy.

Drop the packet in TCP conntrack if we can't kill the connection
ourselves to avoid this.

Reported by: hemao77@gmail.com [ Kernel bugzilla #11058 ]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-09 15:06:12 -07:00
Mattias Nissler adeed48090 rc80211_pid: Fix fast_start parameter handling
This removes the fast_start parameter from the rc_pid parameters
information and instead uses the parameter macro when initializing
the rc_pid state. Since the parameter is only used on initialization,
there is no point of making exporting it via debugfs. This also fixes
uninitialized memory references to the fast_start and norm_offset
parameters detected by the kmemcheck utility.  Thanks to Vegard Nossum
for reporting the bug.

Signed-off-by: Mattias Nissler <mattias.nissler@gmx.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-09 16:16:31 -04:00
Trond Myklebust 381ba74af5 SUNRPC: Ensure our task is notified when an rpcbind call is done
If another task is busy in rpcb_getport_async number, it is more efficient
to have it wake us up when it has finished instead of arbitrarily sleeping
for 5 seconds.

Also ensure that rpcb_wake_rpcbind_waiters() is called regardless of
whether or not rpcb_getport_done() gets called.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:45 -04:00
Chuck Lever 40fef8a649 SUNRPC: Use only rpcbind v2 for AF_INET requests
Some server vendors support the higher versions of rpcbind only for
AF_INET6.  The kernel doesn't need to use v3 or v4 for AF_INET anyway,
so change the kernel's rpcbind client to query AF_INET servers over
rpcbind v2 only.

This has a few interesting benefits:

1. If the rpcbind request is going over TCP, and the server doesn't
   support rpcbind versions 3 or 4, the client reduces by two the number
   of ephemeral ports left in TIME_WAIT for each rpcbind request.  This
   will help during NFS mount storms.

2. The rpcbind interaction with servers that don't support rpcbind
   versions 3 or 4 will use less network traffic.  Also helpful
   during mount storms.

3. We can eliminate the kernel build option that controls whether the
   kernel's rpcbind client uses rpcbind version 3 and 4 for AF_INET
   servers.  Less complicated kernel configuration...

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:37 -04:00
Chuck Lever 8842413aa4 SUNRPC: Use GETADDR for rpcbind version 4 queries
Some rpcbind servers that do support rpcbind version 4 do not support
the GETVERSADDR procedure.  Use GETADDR for querying rpcbind servers
via rpcbind version 4 instead of GETVERSADDR.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:36 -04:00
Chuck Lever 6a77405157 SUNRPC: Use rpcbind version 2 GETPORT
Clean up: Change the version 2 procedure name to GETPORT.  It's the same
procedure number as GETADDR, but version 2 implementations usually refer
to it as GETPORT.

This also now matches the procedure name used in the version 2 procedure
entry in the rpcb_next_version[] array, making it slightly less confusing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:35 -04:00
Chuck Lever fc200e794d SUNRPC: Document some naked integers in rpcbind client
Clean up: Replace naked integers that represent rpcbind protocol versions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:34 -04:00
Chuck Lever 877fcf1039 SUNRPC: More useful debugging output for rpcb client
Clean up dprintk's in rpcb client's XDR decoder functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:33 -04:00
Chuck Lever b22602a673 SUNRPC: Ensure all transports set rq_xtime consistently
The RPC client uses the rq_xtime field in each RPC request to determine the
round-trip time of the request.  Currently, the rq_xtime field is
initialized by each transport just before it starts enqueing a request to
be sent.  However, transports do not handle initializing this value
consistently; sometimes they don't initialize it at all.

To make the measurement of request round-trip time consistent for all
RPC client transport capabilities, pull rq_xtime initialization into the
RPC client's generic transport logic.  Now all transports will get a
standardized RTT measure automatically, from:

  xprt_transmit()

to

  xprt_complete_rqst()

This makes round-trip time calculation more accurate for the TCP transport.
The socket ->sendmsg() method can return "-EAGAIN" if the socket's output
buffer is full, so the TCP transport's ->send_request() method may call
the ->sendmsg() method repeatedly until it gets all of the request's bytes
queued in the socket's buffer.

Currently, the TCP transport sets the rq_xtime field every time through
that loop so the final value is the timestamp just before the *last* call
to the underlying socket's ->sendmsg() method.  After this patch, the
rq_xtime field contains a timestamp that reflects the time just before the
*first* call to ->sendmsg().

This is consequential under heavy workloads because large requests often
take multiple ->sendmsg() calls to get all the bytes of a request queued.
The TCP transport causes the request to sleep until the remote end of the
socket has received enough bytes to clear space in the socket's local
output buffer.  This delay can be quite significant.

The method introduced by this patch is a more accurate measure of RTT
for stream transports, since the server can cause enough back pressure
to delay (ie increase the latency of) requests from the client.

Additionally, this patch corrects the behavior of the RDMA transport, which
entirely neglected to initialize the rq_xtime field.  RPC performance
metrics for RDMA transports now display correct RPC request round trip
times.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Tom Talpey <thomas.talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:15 -04:00
\\\"J. Bruce Fields\\\ a486aeda9b rpc: minor cleanup of scheduler callback code
Try to make the comment here a little more clear and concise.

Also, this macro definition seems unnecessary.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:14 -04:00
\\\"J. Bruce Fields\\\ d25a03cf96 rpc: remove some unused macros
There used to be a print_hexl() function that used isprint(), now gone.
I don't know why NFS_NGROUPS and CA_RUN_AS_MACHINE were here.

I also don't know why another #define that's actually used was marked
"unused".

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:12 -04:00
\\\"J. Bruce Fields\\\ 720b8f2d6f rpc: eliminate unused variable in auth_gss upcall code
Also, a minor comment grammar fix in the same file.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:11 -04:00
Olga Kornievskaia b6b6152c46 rpc: bring back cl_chatty
The cl_chatty flag alows us to control whether a given rpc client leaves

	"server X not responding, timed out"

messages in the syslog.  Such messages make sense for ordinary nfs
clients (where an unresponsive server means applications on the
mountpoint are probably hanging), but not for the callback client (which
can fail more commonly, with the only result just of disabling some
optimizations).

Previously cl_chatty was removed, do to lack of users; reinstate it, and
use it for the nfsd's callback client.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:10 -04:00
Chuck Lever cd983ef81b SUNRPC: Remove obsolete messages during transport connect
Recent changes to the RPC client's transport connect logic make connect
status values ECONNREFUSED and ECONNRESET impossible.

Clean up xprt_connect_status() to account for these changes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:06 -04:00
Chuck Lever cb3997b5a0 SUNRPC: Display some debugging information as text rather than numbers
In rpc_show_tasks(), display the program name, version number, procedure
name and tk_action as human-readable variable-length text fields rather
than columnar numbers.

Doing the symbol lookup here helps in cases where we have actual
debugging output from a kernel log, but don't have access to the kernel
image or RPC module that generated the output.

Sample output:

 -pid- flgs status -client- --rqstp- -timeout ---ops--
  5608 0001    -11 eeb42690 f6d93710        0 f8fa1764 nfsv3 WRITE a:call_transmit_status q:none
  5609 0001    -11 eeb42690 f6d937e0        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5610 0001    -11 eeb42690 f6d93230        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5611 0001    -11 eeb42690 f6d93300        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5612 0001    -11 eeb42690 f6d93090        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5613 0001    -11 eeb42690 f6d933d0        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5614 0001    -11 eeb42690 f6d93cc0        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5615 0001    -11 eeb42690 f6d93a50        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5616 0001    -11 eeb42690 f6d93640        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5617 0001    -11 eeb42690 f6d93b20        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending
  5618 0001    -11 eeb42690 f6d93160        0 f8fa1764 nfsv3 WRITE a:call_status q:xprt_sending

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:58 -04:00
Chuck Lever 38e886e0c1 SUNRPC: Refactor rpc_show_tasks
Clean up: move the logic that displays each task to its own function.
This removes indentation and makes future changes easier.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:56 -04:00
Chuck Lever 68a23ee94e SUNRPC: Don't display the rpc_show_tasks header if there are no tasks
Clean up: don't display the rpc_show_tasks column header unless there is at
least one task to display.  As far as I can tell, it is safe to let the
list_for_each_entry macro decide that each list is empty.

scripts/checkpatch.pl also wants a KERN_FOO at the start of any newly added
printk() calls, so this and subsequent patches will also add KERN_INFO.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:55 -04:00
Chuck Lever b0e1c57ea0 SUNRPC: Rename "call_" functions that are no longer FSM states
The RPC client uses a finite state machine to move RPC tasks through each
step of an RPC request.  Each state is contained in a function in
net/sunrpc/clnt.c, and named call_foo.

Some of the functions named call_foo have changed over the past few years and
are no longer states in the FSM.  These include: call_encode, call_header,
and call_verify.  As a clean up, rename the functions that have changed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:53 -04:00
Chuck Lever 3748f1e447 SUNRPC: Add a function to display the name of an RPC procedure
Improve debugging messages in call_start() and call_verify() by having
them show the RPC procedure name instead of the procedure number.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:53 -04:00
Trond Myklebust 0f38b873ae SUNRPC: Use GFP_NOFS when allocating credentials
Since the credentials may be allocated during the call to rpc_new_task(),
which again may be called by a memory allocator...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:48 -04:00
Trond Myklebust b390c2b55c SUNRPC: An ENOMEM error from call_encode is always fatal
The special 'ENOMEM' case that was previously flagged as non-fatal is
bogus: auth_gss always returns EAGAIN for non-fatal errors, and may in fact
return ENOMEM in the special case where xdr_buf_read_netobj runs out of
preallocated buffer space (invariably a _fatal_ error, since there is no
provision for preallocating larger buffers).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:43 -04:00
Trond Myklebust 8b39f2b410 SUNRPC: Ensure we exit early in case of an encode error
All errors from call_encode(), with exception of EAGAIN are fatal, so we
should immediately return instead of proceeding to xprt_transmit().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:41 -04:00
David S. Miller 79d16385c7 netdev: Move atomic queue state bits into netdev_queue.
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:14:46 -07:00
David S. Miller b19fa1fa91 net: Delete NETDEVICES_MULTIQUEUE kconfig option.
Multiple TX queue support is a core networking feature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:14:24 -07:00
David S. Miller c773e847ea netdev: Move _xmit_lock and xmit_lock_owner into netdev_queue.
Accesses are mostly structured such that when there are multiple TX
queues the code transformations will be a little bit simpler.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:13:53 -07:00
David S. Miller eb6aafe3f8 pkt_sched: Make qdisc_run take a netdev_queue.
This allows us to use this calling convention all the way down into
qdisc_restart().

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:12:38 -07:00
David S. Miller 86d804e10a netdev: Make netif_schedule() routines work with netdev_queue objects.
Only plain netif_schedule() remains taking a net_device, mostly as a
compatability item while we transition the rest of these interfaces.

Everything else calls netif_schedule_queue() or __netif_schedule(),
both of which take a netdev_queue pointer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:11:25 -07:00
David S. Miller 970565bbad netdev: Move gso_skb into netdev_queue.
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:10:33 -07:00
David S. Miller c2aa288548 mac80211: Decrease number of explicit ->tx_queue references.
Accomplish this by using local variables.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:01:52 -07:00
David S. Miller 052979499c pkt_sched: Add qdisc_tx_is_noop() helper and use in IPV6.
This indicates if the NOOP scheduler is what is active for TX on a
given device.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:01:27 -07:00
David S. Miller 6fa9864b53 net: Clean up explicit ->tx_queue references in link watch.
First, we add a qdisc_tx_changing() helper which returns true if the
qdisc attachment is in transition.

Second, we remove an assertion warning which is of limited value and
is hard to express precisely in a multiqueue environment.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:01:06 -07:00
David S. Miller ee609cb362 netdev: Move next_sched into struct netdev_queue.
We schedule queues, not the device, for output queue processing in BH.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 22:58:37 -07:00
David S. Miller 74d58a0c1d pkt_sched: Make netem queue agnostic.
It just wants the root qdisc given an arbitrary qdisc,
and that is simply qdisc->dev_queue->qdisc

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
2008-07-08 22:57:51 -07:00
David S. Miller 68dfb42798 pkt_sched: Kill stats_lock member of struct Qdisc.
It is always equal to qdisc->dev_queue->lock

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 22:57:31 -07:00
David S. Miller 816f3258e7 netdev: Kill qdisc_ingress, use netdev->rx_queue.qdisc instead.
Now that our qdisc management is bi-directional, per-queue, and fully
orthogonal, there is no reason to have a special ingress qdisc pointer
in struct net_device.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 22:49:00 -07:00