Use a different CQ for send completions, where send completions are
polled by the interrupt-driven receive completion handler. Therefore,
interrupts aren't used for the send CQ.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Now that both the posting and reaping of receive buffers is done in
the completion path, the counter of outstanding buffers not be atomic.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently, the recv buffer posting logic is based on the transactional
nature of iSER which allows for posting a buffer before sending a PDU.
Change this to post only when the number of outstanding recv buffers
is below a water mark and in a batched manner, thus simplifying and
optimizing the data path. Use a pre-allocated ring of recv buffers
instead of allocating from kmem cache. A special treatment is given
to the login response buffer whose size must be 8K unlike the size of
buffers used for any other purpose which is 128 bytes.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We will make a major change in the recv buffer posting logic, after
which the problem commit bba7ebb "avoid recv buffer exhaustion caused
by unexpected PDUs" comes to solve doesn't exist any more, so revert it.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the nes driver to return -ENOMEM on SQ/RQ overflow to match the
return code of other RDMA HW drivers (e.g cxgb3, ehca, mlx4, mthca).
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Acked-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is a double disconnect during AE processing, causing crashes.
While fixing the crash, also simplify the AE handling code.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a listener is destroyed and there is an MPA response pending for
loopback connection, the active side cm_node gets destroyed twice:
once in cm_event_connect_error() and again in nes_accept()/nes_reject().
Increment the cm_node's refcount so it's not destroyed by
cm_event_connect_error().
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After running long iterative MPI tests, sometimes ethtool reports a
"CM Destroy Listener" count more than the "CM Create Listener" count.
This inconsistency is fixed by making counter variables atomic.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the caller does not pass a valid in_wc to process_mad(), return MAD
failure status, as it is not possible to generate a valid MAD redirect
response (and redirects are the only MAD responses ehca generates).
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Dunno, what was the idea, it wasn't used for a long time.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The max_dest_rd_atomic and max_qp_rd_atomic values are properly
returned by query_qp(), so there should not be an error returned when
they are queried.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The irq_spinlock is only taken in tasklet context, so it is safe not to
disable hardware interrupts.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_qp already holds a pointer to the ib device. No need to dive to the
hw device object to retrieve it.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch replaces dev->mc_count in all drivers (hopefully I didn't miss
anything). Used spatch and did small tweaks and conding style changes when
it was suitable.
Jirka
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure compiler won't do weird things with limits by using the
rlimit helpers added in 3e10e716 ("resource: add helpers for fetching
rlimits"). E.g. fetching them twice may return 2 different values
after writable limits are implemented.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As of commit f56bcd8 ("IPoIB: Use separate CQ for UD send
completions"), there are no TX interrupts. Change the ethtool code
not to report TX moderation settings, so users will not be misled to
think they can control TX interrupt moderation. Pointed out by Alex
Vainman <alexv@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Failure to rearm a CQ means the cxgb3 device is wedged, but we shouldn't
kill the whole system with a BUG_ON() if this happens.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cm: Revert association of an RDMA device when binding to loopback
Revert the following change from commit 6f8372b6 ("RDMA/cm: fix
loopback address support")
The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address. (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdma_bind_addr,
no device is associated with the rdma_cm_id. Fix this.
It turns out that important apps such as Open MPI depend on
rdma_bind_addr() NOT associating any RDMA device when binding to a
loopback address. Open MPI is being updated to deal with this, but at
least until a new Open MPI release is available, maintain the previous
behavior: allow rdma_bind_addr() to succeed, but do not bind to a
device.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Use the %pM kernel extension to display the MAC address.
The only difference in the output is that the MAC address is
shown in the usual colon-separated hex notation.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct misspelled "CONFIG_IPv6" that was introduced in commit
d14714df ("IB/addr: Fix IPv6 routing lookup"). The config variable
should be all uppercase.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
[ This was my fault when I munged the original patch. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mlx4_ib_post_recv(), we should check the queue for overflow using
recv_cq instead of send_cq (current code looks like a copy-and-paste
mistake).
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As for memfree mthca hardware, ConnectX also requires SRQ WQE scatter
entries to be initialized with the invalid L_Key at SRQ creation time.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the "ignoring return value of '...', declared with attribute
warn_unused_result" compiler warning in several users of the new kfifo
API.
It removes the __must_check attribute from kfifo_in() and
kfifo_in_locked() which must not necessary performed.
Fix the allocation bug in the nozomi driver file, by moving out the
kfifo_alloc from the interrupt handler into the probe function.
Fix the kfifo_out() and kfifo_out_locked() users to handle a unexpected
end of fifo.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rename kfifo_put... into kfifo_in... to prevent miss use of old non in
kernel-tree drivers
ditto for kfifo_get... -> kfifo_out...
Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc
annotations more readable.
Add mini "howto porting to the new API" in kfifo.h
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the pointer to the spinlock out of struct kfifo. Most users in
tree do not actually use a spinlock, so the few exceptions now have to
call kfifo_{get,put}_locked, which takes an extra argument to a
spinlock.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new generic kernel FIFO implementation.
The current kernel fifo API is not very widely used, because it has to
many constrains. Only 17 files in the current 2.6.31-rc5 used it.
FIFO's are like list's a very basic thing and a kfifo API which handles
the most use case would save a lot of development time and memory
resources.
I think this are the reasons why kfifo is not in use:
- The API is to simple, important functions are missing
- A fifo can be only allocated dynamically
- There is a requirement of a spinlock whether you need it or not
- There is no support for data records inside a fifo
So I decided to extend the kfifo in a more generic way without blowing up
the API to much. The new API has the following benefits:
- Generic usage: For kernel internal use and/or device driver.
- Provide an API for the most use case.
- Slim API: The whole API provides 25 functions.
- Linux style habit.
- DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros
- Direct copy_to_user from the fifo and copy_from_user into the fifo.
- The kfifo itself is an in place member of the using data structure, this save an
indirection access and does not waste the kernel allocator.
- Lockless access: if only one reader and one writer is active on the fifo,
which is the common use case, no additional locking is necessary.
- Remove spinlock - give the user the freedom of choice what kind of locking to use if
one is required.
- Ability to handle records. Three type of records are supported:
- Variable length records between 0-255 bytes, with a record size
field of 1 bytes.
- Variable length records between 0-65535 bytes, with a record size
field of 2 bytes.
- Fixed size records, which no record size field.
- Preserve memory resource.
- Performance!
- Easy to use!
This patch:
Since most users want to have the kfifo as part of another object,
reorganize the code to allow including struct kfifo in another data
structure. This requires changing the kfifo_alloc and kfifo_init
prototypes so that we pass an existing kfifo pointer into them. This
patch changes the implementation and all existing users.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
direct I/O fallback sync simplification
ocfs: stop using do_sync_mapping_range
cleanup blockdev_direct_IO locking
make generic_acl slightly more generic
sanitize xattr handler prototypes
libfs: move EXPORT_SYMBOL for d_alloc_name
vfs: force reval of target when following LAST_BIND symlinks (try #7)
ima: limit imbalance msg
Untangling ima mess, part 3: kill dead code in ima
Untangling ima mess, part 2: deal with counters
Untangling ima mess, part 1: alloc_file()
O_TRUNC open shouldn't fail after file truncation
ima: call ima_inode_free ima_inode_free
IMA: clean up the IMA counts updating code
ima: only insert at inode creation time
ima: valid return code from ima_inode_alloc
fs: move get_empty_filp() deffinition to internal.h
Sanitize exec_permission_lite()
Kill cached_lookup() and real_lookup()
Kill path_lookup_open()
...
Trivial conflicts in fs/direct-io.c
Always set bad_wr when an immediate error is detected. Return ENOMEM
for queue full instead of EINVAL to match other drivers.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (222 commits)
[SCSI] zfcp: Remove flag ZFCP_STATUS_FSFREQ_TMFUNCNOTSUPP
[SCSI] zfcp: Activate fc4s attributes for zfcp in FC transport class
[SCSI] zfcp: Block scsi_eh thread for rport state BLOCKED
[SCSI] zfcp: Update FSF error reporting
[SCSI] zfcp: Improve ELS ADISC handling
[SCSI] zfcp: Simplify handling of ct and els requests
[SCSI] zfcp: Remove ZFCP_DID_MASK
[SCSI] zfcp: Move WKA port to zfcp FC code
[SCSI] zfcp: Use common code definitions for FC CT structs
[SCSI] zfcp: Use common code definitions for FC ELS structs
[SCSI] zfcp: Update FCP protocol related code
[SCSI] zfcp: Dont fail SCSI commands when transitioning to blocked fc_rport
[SCSI] zfcp: Assign scheduled work to driver queue
[SCSI] zfcp: Remove STATUS_COMMON_REMOVE flag as it is not required anymore
[SCSI] zfcp: Implement module unloading
[SCSI] zfcp: Merge trace code for fsf requests in one function
[SCSI] zfcp: Access ports and units with container_of in sysfs code
[SCSI] zfcp: Remove suspend callback
[SCSI] zfcp: Remove global config_mutex
[SCSI] zfcp: Replace local reference counting with common kref
...
When the remote node's ethernet address changes, the connection keeps
trying to connect using the old address. The connection wil continue
failing until the driver is unloaded and loaded again (eiter reboot or
rmmod). Fix this by checking that the NIC has the correct address
before starting a connection.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A FIN that is received during an MPA start up sequence causes a
timeout in iwcm.c. The connection has not been completely closed so
the iwcm code is waiting for resources to be cleaned up. This closes
the connection so everything cleans up correctly.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We fail when creating many qps as kmap() fails for sq_vbase.
Fix this by doing kunmap() as soon as we are done with sq_vbase.
We do kunmap() in one of the locations below:
(1) nes_destroy_qp()
(2) nes_accept()
(3) nes_connect_event
We keep a flag to avoid multiple calls to kunmap().
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
STags are generated randomly but the driver does not correctly prevent
a zero STag. Using STag zero is privileged and causes a user space
application to fail. This change prevents the driver from trying to
allocate a zero STag.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
While running a Xansation test, an active side node crashed. The
problem started on the passive side, which generated an STtag that was
0. The passive side sent a TERMINATE instead of an MPA REJECT msg.
The active side, receives TERMINATE and sends connect_err() and set
the cm_node state to CLOSED. The passive side sends FIN + ACK after
TERMINATE. Active side ends up in handle_ack_pkt() and send_reset().
send_reset() consumes 1 cm_node's ref_count. Because the cm_node is
in CLOSED state, which means that cm_node will be destroyed after
completion of the connect_err() indication, CM will crash after
send_reset().
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the listener is destroyed for a loopback connection, the listener
node gets a reset event. This causes a crash as the listener is not
expecting a reset event. Code review of cm_event_reset() during
debugging showed the cm_id ref count is incremented after calling its
event handler and not before.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
While running IMP_EXT's window test, we saw a crash in nes_accept().
Here is the sequence of what happened:
(1) In MVAPICH2, connect request is received for port #0.
FIX: Add a nes_connect() check to make sure local or remote tcp port
is not 0.
(2) Remote node's (passive) TCP stack sends a reset when it gets a
connect request because of port = 0. Active side set the connect
error to IW_CM_EVENT_STATUS_REJECTED when it received the RST from
remote node.
FIX: The corect error code is -ECONNRESET.
(3) Wrong error code of IW_CM_EVENT_STATUS_REJECTED causes the core to
destroy its listener ports. Here there are connections that may
have sent an MPA request up and waiting for accept or reject. But
the listener and its cm_nodes have been freed already causing the
crash noticed.
FIX: The cm_node is freed only if its state is not
NES_CM_STATE_MPAREQ_RCVD. If cm_node's state is
NES_CM_STATE_MPAREQ_RCVD then its new state is set to
NES_CM_STATE_LISTENER_DESTROYED and it is not freed. When
nes_accept() or nes_reject() is received, its state is checked
for NES_CM_STATE_LISTENER_DESTROYED and in this case the cm_node
is freed and error is returned.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During testing of REJECT connection error handling, we saw that the
cm_id resources are not released. When the retransmit timer expires,
we need to send a reset message to remote node before issuing the
ABORTED event.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During Xansation testing, we saw that error handling of MPA frame
msg/response is not handled properly.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ORD size needs updating as we are supporting more inbound READ
resources per connection.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change MAX_CM_BUFFER for MPA frames to be conformant to RFC 5044:
we need 512 + 20 instead of 512.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The size argument to ioremap_nocache should be the size of desired
information, not the pointer to it.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@expression@
expression *x;
@@
x =
<+...
*sizeof(x)
...+>// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Update copyright from Intel-NE, Inc. to Intel Corporation. Use proper
branding string in Kconfig and simplify description.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a check to nes_create_cq() to return -EINVAL if creating a CQ with
depth > max_cqe (32766).
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add IB_SINGAL_ALL_WR support as an iWARP extension. If set, make sure
all WR for the QP are signalled. Consolidate flags used in nesqp
structure.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add additional PHY uC status check in case PHY firmware is not running
properly with heartbeat. Add a hard PHY reset if uC status is 0x0
after initial reset.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Always set bad_wr when an immediate error is detected. Do not report
success if an error occurred.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Always set bad_wr when an immediate error is detected.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IB_WR_SEND_WITH_INV, IB_WR_RDMA_READ_WITH_INV
and IB_WR_LOCAL_INV.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
On error, set bad_wr in nes_post_recv(). Stop processing ib_wr queue
when an error is detected.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
On error, set bad_wr in nes_post_send(). Stop processing ib_wr queue
when an error is detected.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ibmebus_free_irq() function, which might sleep, was called with
interrupts disabled. To fix this, make sure that no interrupts are
running by killing the interrupt tasklet. Also lock the
shca_list_lock to protect against the poll_eqs_timer running
concurrently.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use bitmap_weight() instead of finding all set bits in bitmap by hand.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Ralph Campbell <infinipath@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IPoIB can miss a change in destination GID under some conditions. The
problem is caused when ipoib_neigh->dgid contains a stale address.
The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().
This can happen when a system using bonding on its IPoIB interfaces
has switched its active interface from interface A to B and back to A.
The system that fails over will not correctly processes the 2nd
address change, as described below.
When an address has changed neighbor->ha is updated with the new
address. Each neighbor has an associated ipoib_neigh.
ipoib_neigh->dgid also holds a copy of the remote node's hardware
address. When an address changes neighbor->ha is updated by the
network layer (arp code) with the new address. IPoIB detects this
change in ipoib_start_xmit() by comparing neighbor->ha with
ipoib_neigh->dgid. The bug is that ipoib_neigh->dgid may already
contain the new address (A) thus the change from B to A is missed by
ipoib. Here is the sequence of events:
ipoib_neigh->dgid = A and neighbor->ha = A
The address is switched to B (the first switch)
neighbor->ha = B
The change is seen in ipoib_start_xmit() -- neighbor->ha !=
ipoib_neigh->dgid so ipoib_neigh is released, and a new one is
allocated.
The allocator may return the same chunk of memory that was just
released, therefore ipoib_neigh->dgid still contains A at this point.
ipoib_neigh->dgid should be updated in neigh_add_path(), but if the
following conditions are true dgid is not updated:
1) __path_find() returns a path
2) path->ah is NULL
The remote system now switches from address B to A, neighbor->ha is
updated to A.
Now we have again : ipoib_neigh->dgid = A and neighbor->ha = A
Since the addresses are the same ipoib won't process the change in
address. Fix this by zeroing out the dgid field when allocating a new
struct ipoib_neigh.
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
agp: Remove the BKL from agp_open
inifiband: Remove BKL from ipath_open()
mips: Remove BKL from tb0219
drivers: Remove BKL from scx200_gpio
drivers: Remove BKL from pc8736x_gpio
parisc: Remove BKL from eisa_eeprom
rtc: Remove BKL from efirtc
input: Remove BKL from hp_sdc_rtc
hw_random: Remove BKL from core
macintosh: Remove BKL from ans-lcd
nvram: Drop the bkl from non-generic nvram_llseek()
nvram: Drop the bkl from nvram_llseek()
mem_class: Drop the bkl from memory_open()
spi: Remove BKL from spidev_open
drivers: Remove BKL from cs5535_gpio
drivers: Remove BKL from misc_open
When iser enabled lu reset support it did not set the
bit to allow userspace to get/set the timeout. This
sets the tgt and lu reset timeout bits.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
fix some typos and punctuation in comments
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Include link scope as part of address resolution. Combine local
and remote address resolution into a single, simpler code path.
Fix error checking in the IPv6 routing lookups.
Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
[ Fix up cma_check_linklocal() for !IPV6 case. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Merge resolve local/remote address resolution into a single
data flow to ensure consistent access and use of the local routing
tables.
Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The RDMA CM is intended to support the use of a loopback address
when establishing a connection; however, the behavior of the CM
when loopback addresses are used is confusing and does not always
work, depending on whether loopback was specified by the server,
the client, or both.
The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address. (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdam_bind_addr,
no device is associated with the rdma_cm_id. Fix this.
If a loopback address is specified by the client as the destination
address for a connection, it will fail to establish a connection.
This is true even if the server is listing across all addresses or
on the loopback address itself. The issue is that the server tries
to translate the IP address carried in the REQ message to a local
net_device address, which fails. The translation is not needed in
this case, since the REQ carries the actual HW address that should
be used.
Finally, cleanup loopback support to be more transport neutral.
Replace separate calls to get/set the sgid and dgid from the
device address to a single call that behaves correctly depending
on the format of the device address. And support both IPv4 and
IPv6 address formats.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
[ Fixed RDS build by s/ib_addr_get/rdma_addr_get/ - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The struct rdma_dev_addr stores net_device address information:
the source device address, destination hardware address, and
broadcast address. For consistency, store the net_device type
rather than converting it to the rdma_node_type.
The type indicates the format of the various hardware addresses,
which is what we're concerned with, and not the RDMA node type
that the address may map to.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a source address is provided, verify that the address family matches
that of the destination address. If the source is not specified, use the
same address family as the destination.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Provide the device interface when resolving route information to
ensure that the correct outbound device is used. This will also
simplify processing of sin6_scope_id for IPv6 support.
Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthrope@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If joining to an AF_INET6 address, we need to map the address to a MGID
in the same way as the IP stack. The old code would just fall through to
the IPv4 case and generate garbage.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
RDMA CM treats AF_INET6 addresses that are either 0 or prefixed with
FF1x:A01B::/32 as MGIDs, but the detection for the prefix was buggy;
fix it up.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
for_each_netdev() should be used with RTNL or dev_base_lock held,
or else we risk a crash.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export rdma_set_ib_paths to user space to allow applications to
manually set the IB path used for connections. This allows
alternative ways for a user space application or library to obtain
path record information, including retrieving path information
from cached data, avoiding direct interaction with the IB SA.
The IB SA is a single, centralized entity that can limit scaling
on large clusters running MPI applications.
Future changes to the rdma cm can expand on this framework to
support the full range of features allowed by the IB CM, such as
separate forward and reverse paths and APM.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After dma-mapping an SG list provided by the SCSI midlayer, iser has
to make sure the mapped SG is "aligned for RDMA" in the sense that its
possible to produce one mapping in the HCA IOMMU which represents the
whole SG. Next, the mapped SG is formatted for registration with the HCA.
This patch re-writes the logic that does the above, to make it clearer
and simpler. It also fixes a bug in the being aligned for RDMA checks,
where a "start" check wasn't done but rather only "end" check.
Signed-off-by: Alexander Nezhinsky <alexandern@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code has a limitation: an LSO header is not allowed to cross a
64 byte boundary. This patch removes this limitation by setting the
WQE RR for large headers thus allowing LSO headers of any size. The
extra buffer reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64
to 128 bytes, assuming this is reasonable upper limit for header
length. Also, this patch will cause IB_DEVICE_UD_TSO to be set only
for HCA FW versions that set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version
2.6.000 and higher.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is no such flag DE - the field is reserved and should be zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch was generated by
git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/'
and the cumsumed was found by checking the diff for aquire.
Signed-off-by: Uwe Kleine-Knig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
cycle_kernel_lock() got pushed down to ipath_open(). I tried hard to
understand what it might protect, but finally gave up.
Roland noted that qlogic seems to have abandoned the ipath driver and
came to the following wise conclusion: "So I guess if the BKL stuff is
blocking you in any way, we can just drop it from ipath and leave it
as yet another race condition in a rotting old driver."
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <adad44tj090.fsf@cisco.com>
Cc: Roland Dreier <rdreier@cisco.com>
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: (34 commits)
[SCSI] qla2xxx: Fix NULL ptr deref bug in fail path during queue create
[SCSI] st: fix possible memory use after free after MTSETBLK ioctl
[SCSI] be2iscsi: Moving to pci_pools v3
[SCSI] libiscsi: iscsi_session_setup to allow for private space
[SCSI] be2iscsi: add 10Gbps iSCSI - BladeEngine 2 driver
[SCSI] zfcp: Fix hang when offlining device with offline chpid
[SCSI] zfcp: Fix lockdep warning when offlining device with offline chpid
[SCSI] zfcp: Fix oops during shutdown of offline device
[SCSI] zfcp: Fix initial device and cfdc for delayed adapter allocation
[SCSI] zfcp: correctly initialize unchained requests
[SCSI] mpt2sas: Bump version 02.100.03.00
[SCSI] mpt2sas: Support dev remove when phy status is MPI2_EVENT_SAS_TOPO_PHYSTATUS_VACANT
[SCSI] mpt2sas: Timeout occurred within the HANDSHAKE logic while waiting on firmware to ACK.
[SCSI] mpt2sas: Call init_completion on a per request basis.
[SCSI] mpt2sas: Target Reset will be issued from Interrupt context.
[SCSI] mpt2sas: Added SCSIIO, Internal and high priority memory pools to support multiple TM
[SCSI] mpt2sas: Copyright change to 2009.
[SCSI] mpt2sas: Added mpi2_history.txt for MPI2 headers.
[SCSI] mpt2sas: Update driver to MPI2 REV K headers.
[SCSI] bfa: Brocade BFA FC SCSI driver
...
This patch allows a local IPv6 address to be resolved by rdma_cm.
To reproduce the problem:
$ rping -s -v -a ::0 &
$ rping -c -v -a <IPv6 address local to this system>
rdma_resolve_addr error -1
Local IPv6 address was obtained with "ip addr show ib0"
Addresses: https://bugs.openfabrics.org/show_bug.cgi?id=1759
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
in_dev_get() can return NULL. If it does, iwch_query_port() will crash.
Handle the NULL case by mapping it to port state INIT.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In commit cb58160e ("RDMA/iwcm: Reject the connection when the cm_id
is destroyed") a call to the provider's reject handler was added to
destroy_cm_id() to fix a provider endpoint leak. This call needs to
be done with interrupts enabled. So unlock and relock around this
call. This is safe because:
1) the provider will do nothing with this endpoint until the iwcm either
accepts or rejects.
2) the lock is only released after the iwcm state is changed, so an
errant iwcm app that is destroying -and- rejecting the connection
concurrently will get a failure on one of the calls.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These string query operations were supposed to be replaced by the
generic get_sset_count() starting in 2007. Convert the remaining
implementations.
Also remove calls to these operations to initialise drvinfo->n_stats.
The ethtool core code already does that.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use this_cpu_ptr and __this_cpu_ptr in locations where straight
transformations are possible because per_cpu_ptr is used with
either smp_processor_id() or raw_smp_processor_id().
cc: David Howells <dhowells@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch contains changes that allow iscsi_session_setup
to allocate private space for LLD's
Signed-off-by: Jayamohan Kallickal <jayamohank@serverengines.com>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Don't turn on carrier for a non-active port
IB/mthca: Fix access to freed memory in catastrophic event handling
mlx4_core: Pass cache line size to device FW
RDMA/nes: Remove duplicate .ndo_set_mac_address field initialization
IB/mad: Fix lock-lock-timer deadlock in RMPP code
Multicast joins can succeed even if the IB port is down. This happens
when the SM runs on the same port with the requesting port. However,
IPoIB calls netif_carrier_on() when the join of the broadcast group
succeeds, without caring about the state of the IB port. The result
is an IPoIB interface in RUNNING state but without an active IB port
to support it.
If a bonding interface uses this IPoIB interface as a slave it might
not detect that this slave is almost useless and failover
functionality will be damaged. The fix checks the state of the IB
port in the carrier_task before calling netif_carrier_on().
Adresses: https://bugs.openfabrics.org/show_bug.cgi?id=1726
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
catas_reset() uses a pointer to mthca_dev, but mthca_dev is not valid
after the call to __mthca_restart_one().
Based on a similar patch for mlx4 (634354d7, "mlx4: Fix access to
freed memory") by Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The definition of nes_netdev_ops has initializations of a local function
and eth_mac_addr for its ndo_set_mac_address field. This change uses only
the local function.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r@
identifier I, s, fld;
position p0,p;
expression E;
@@
struct I s =@p0 { ... .fld@p = E, ...};
@s@
identifier I, s, r.fld;
position r.p0,p;
expression E;
@@
struct I s =@p0 { ... .fld@p = E, ...};
@script:python@
p0 << r.p0;
fld << r.fld;
ps << s.p;
pr << r.p;
@@
if int(ps[0].line)!=int(pr[0].line) or int(ps[0].column)!=int(pr[0].column):
cocci.print_main(fld,p0)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Holding agent->lock across cancel_delayed_work() (which does
del_timer_sync()) in ib_cancel_rmpp_recvs() leads to lockdep reports of
possible lock-timer deadlocks if a consumer ever does something that
connects agent->lock to a lock taken in IRQ context (cf
http://marc.info/?l=linux-rdma&m=125243699026045).
Fix this by changing the list items to a new state "CANCELING" while
holding the lock, and then canceling the delayed work without holding
the lock. If the delayed work runs after the lock is dropped, it will
see the state is CANCELING and return immediately, so the list will
stay stable while we traverse it with the lock not held.
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Originally, walk_memory_resource() was introduced to traverse all memory
of "System RAM" for detecting memory hotplug/unplug range. For doing so,
flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
memory hotplug.
But for using other purpose, /proc/kcore, this may includes some firmware
area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
check strict to find out busy "System RAM".
Note: PPC64 keeps their own walk_memory_resouce(), which walk through
ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
patch makes no difference in behavior, finally.
And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
Because pfn_valid() just show "there is memmap or not* and cannot be used
for "there is physical memory or not", this function is useful in generic
to scan physical memory range.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let attribute group vectors be declared "const". We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
netxen: update copyright
netxen: fix tx timeout recovery
netxen: fix file firmware leak
netxen: improve pci memory access
netxen: change firmware write size
tg3: Fix return ring size breakage
netxen: build fix for INET=n
cdc-phonet: autoconfigure Phonet address
Phonet: back-end for autoconfigured addresses
Phonet: fix netlink address dump error handling
ipv6: Add IFA_F_DADFAILED flag
net: Add DEVTYPE support for Ethernet based devices
mv643xx_eth.c: remove unused txq_set_wrr()
ucc_geth: Fix hangs after switching from full to half duplex
ucc_geth: Rearrange some code to avoid forward declarations
phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
drivers/net/phy: introduce missing kfree
drivers/net/wan: introduce missing kfree
net: force bridge module(s) to be GPL
Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
...
Fixed up trivial conflicts:
- arch/x86/include/asm/socket.h
converted to <asm-generic/socket.h> in the x86 tree. The generic
header has the same new #define's, so that works out fine.
- drivers/net/tun.c
fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
switched over to using 'tun->socket.sk' instead of the redundantly
available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
to the TUN driver") which added a new 'tun->sk' use.
Noted in 'next' by Stephen Rothwell.
If the cm_id of a connect request is destroyed prior to the ULP
accepting or rejecting the connection, then the provider never cleans
up the connection. The iwcm should explicitly reject these
connections if the cm_id is destroyed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
FW mismatches can cause a crash in the iw_cxgb3 event handler.
- NULL the t3cdev->ulp pointer on failures in cxio_rdev_open()
- Silently ignore events when the ulp ptr is NULL in iwch_err_handler()
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MADs are UD and can be dropped if there are no receives posted, so
allow receive queue size to be set with a module parameter in case the
queue needs to be lengthened. Send side tuning is done for symmetry
with receive.
Signed-off-by: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Lockdep reported a possible deadlock with cm_id_priv->lock,
mad_agent_priv->lock and mad_agent_priv->timed_work.timer; this
happens because the mad module does
cancel_delayed_work(&mad_agent_priv->timed_work);
while holding mad_agent_priv->lock. cancel_delayed_work() internally
does del_timer_sync(&mad_agent_priv->timed_work.timer).
This can turn into a deadlock because mad_agent_priv->lock is taken
inside cm_id_priv->lock, so we can get the following set of contexts
that deadlock each other:
A: holding cm_id_priv->lock, waiting for mad_agent_priv->lock
B: holding mad_agent_priv->lock, waiting for del_timer_sync()
C: interrupt during mad_agent_priv->timed_work.timer that takes
cm_id_priv->lock
Fix this by using the new __cancel_delayed_work() interface (which
internally does del_timer() instead of del_timer_sync()) in all the
places where we are holding a lock.
Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757
Reported-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Old query_port code reports static MTU and link state values.
Instead, map actual MTU to next largest IB_MTU_* constant and
correctly report link state.
Cc: Steve Wise <swise@opengridcomputing.com>
Reported-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The disconn routine has been reworked to acoomodate the terminate and
flushing changes. The routine has been reorganized to make all the
decisions at the start then it performs all the required operations.
This simplified the lock handling and is easier to follow.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the flush status to fill in cqe status when a specific error has
been identified. Subsequent flushed completions still use the flushed
value.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a flush request is given to the hw, it will place one cqe marked
as flushed (unless there is nothing to flush). An application that is
waiting for all wqe's to complete will be left hanging. This modifies
poll_cq to return the correct number of flushes for the pending
elements on the wq.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When an asynchronous event occurs that requires a terminate, it is
sometimes possible to identify the wqe in error. This change uses
flush to get this information to the poll routine. The flush
operation puts the status into the cqe. If this information is not
available, it continues to use the more generic flush code as before.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement the sending and receiving of Terminate packets.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
CQ errors are not being handled correctly. Put in the the upcall for
CQ errors.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a QP is destroyed, unprocessed CQ entries could still reference
the QP. This change zeroes the context value at QP destroy time. By
skipping over cqe's with a zero context, poll_cq no longer processes a
cqe for a destroyed QP.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The routine to allocate a cqp request is not called from process
context code. Since it is not OK to sleep, it needs to use GFP_ATOMIC
not GFP_KERNEL.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code currently has a work structure in the QP. This requires a
lock and a pending flag to ensure there is never more than one request
active. When two events happen quickly (such as FIN and LLP CLOSE),
it causes unnecessary timeouts since the second one is dropped.
This fix allocates memory for the work request so the second one can
be queued. A lock is removed since it is no longer needed.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During termination, it is possible for the refcnt to go to zero while
the worker thread is posting events upward. This fix increments the
refcnt before the request is passed to the worker thread. The thread
decrements the refcnt when the request is completed.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Userspace apps are supposed to release all ib device resources if they
receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the
app has no way of knowing when the device has come back up, except to
repeatedly attempt ibv_open_device() until it succeeds.
However, currently there is no protection against the open succeeding
while the device is in being removed following the fatal event. In
this case, the open will succeed, but as a result the device waits in
the middle of its removal until the new app releases its resources --
and the new app will not do so, since the open succeeded at a point
following the fatal event generation.
This patch adds an "active" flag to the device. The active flag is set
to false (in the fatal event flow) before the "fatal" event is
generated, so any subsequent ibv_dev_open() call to the device will
fail until the device comes back up, thus preventing the above
deadlock.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the mthca driver uses the same name for interrupts for every
device in the system. This can make it very confusing trying to work
out exactly which device MSI-X interrupts are for. Change the driver
to add the PCI name of the device to the interrupt name.
Signed-off-by: Arputham Benjamin <abenjamin@sgi.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_ib_lock_cqs()/mthca_ib_unlock_cqs() are helper functions that
lock/unlock both CQs attached to a QP in the proper order to avoid
AB-BA deadlocks. Annotate this so sparse can understand what's going
on (and warn us if we misuse these functions).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_config_reg.h was including <asm/page.h> for no reason -- the whole
file is just defines of constants, so it's entirely self-contained.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Userspace apps are supposed to release all ib device resources if they
receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the
app has no way of knowing when the device has come back up, except to
repeatedly attempt ibv_open_device() until it succeeds.
However, currently there is no protection against the open succeeding
while the device is in being removed following the fatal event. In
this case, the open will succeed, but as a result the device waits in
the middle of its removal until the new app releases its resources --
and the new app will not do so, since the open succeeded at a point
following the fatal event generation.
This patch adds an "active" flag to the device. The active flag is set
to false (in the fatal event flow) before the "fatal" event is
generated, so any subsequent ibv_dev_open() call to the device will
fail until the device comes back up, thus preventing the above
deadlock.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_lock_cqs()/mlx4_ib_unlock_cqs() are helper functions that
lock/unlock both CQs attached to a QP in the proper order to avoid
AB-BA deadlocks. Annotate this so sparse can understand what's going
on (and warn us if we misuse these functions).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev->ibdev.iwcm allocation may fail, prevent a dereference.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since the original commit 883a99c7 ("[IB] uverbs: Add a mask of device
methods allowed for userspace"), the uverbs core returns EINVAL for
commands not implemented by a specific low-level driver.
This creates a problem that there is no way to tell the difference
between an unimplemented command and an implemented one which is
incorrectly invoked (which also returns EINVAL).
The fix is to have unimplemented commands return ENOSYS.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Until now, retries were only sent when joining a multicast group. This
patch will adds retries when leaving a multicast group as well.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace open-coded reimplementations with printk_once().
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the %pM conversion specifier to print a MAC address.
Signed-off-by: Tobias Klauser <klto@zhaw.ch>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than just defining static spinlock_t variables and then
initializing them later in init functions, simply define them with
DEFINE_SPINLOCK() and remove the calls to spin_lock_init(). This cleans
up the source a tad and also shrinks the compiled code; eg on x86-64:
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-40 (-40)
function old new delta
ib_uverbs_init 336 326 -10
ib_mad_init_module 147 137 -10
ib_sa_init 123 103 -20
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The hop count field in a directed route MAD is only allowed to be in the
range 0 to 63 (by spec). Check that this really is the case to avoid
accessing outside the bounds of the hop array.
Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Check that the format of multicast link addresses is correct before
taking them from dev->mc_list to priv->multicast_list. This way we
never try to send a bogus address to the SA, which prevents badness
from erronous 'ip maddr addr add', broken bonding drivers, etc.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IPoIB currently must use irqsave locking for priv->lock, since it is
taken from interrupt context in one path. However, ipoib_send() does
skb_orphan(), and the network stack locking is not IRQ-safe.
Therefore we need to make sure we don't hold priv->lock when calling
ipoib_send() to avoid lockdep warnings (the code was almost certainly
safe in practice, since the only code path that takes priv->lock from
interrupt context would never call into the network stack).
Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757
Reported-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
strlcpy() will always null terminate the string. node_desc is not
guaranteed to be NUL-terminated so just use memcpy().
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver was reporting CQE flags in the wrong bit positions, causing
consumers to miss incoming immediate data.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The old code used a lot of hard-coded values, which might not be valid
in all environments (especially routed fabrics or partitioned
subnets). Copy as much information as possible from the incoming
request to correct that.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make port autodetect mode the default for the ehca driver. The
autodetect code has been in the kernel for several releases now and
has proved to be stable.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A close/abort while waiting for a wr_ack during connection migration
can cause a hung process in iwch_accept_cr/iwch_reject_cr.
The fix is to set rpl_error/rpl_done and wake up the waiters when we
get a close/abort while in MPA_REQ_RCVD state.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Keep ref on connection request endpoints until either accepted or
rejected so it doesn't get freed early.
- Endpoint flags now need to be set via atomic bitops because they can
be set on both the iw_cxgb3 workqueue thread and user disconnect
threads.
- Don't move out of CLOSING too early due to multiple calls to
iwch_ep_disconnect.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Massage the err_handler upcall into an event handler upcall, pass
netdev port events to the cxgb3 ULPs and generate RDMA port events
based on LLD port events.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The generic packet receive code takes care of setting
netdev->last_rx when necessary, for the sake of the
bonding ARP monitor.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Neil Horman <nhorman@txudriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to put ethtool_ops in data, they should be const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __init and __exit annotations to the module_init/module_exit
functions from drivers/infiniband/core/addr.c and cma.c.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Increment version number for DMEM toleration.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dma_sync_single() is deprecated now, and the use in mthca is wrong:
there should be a dma_sync_single_for_cpu() before touching the memory
from the CPU, and a dma_sync_single_for_device() afterwards. Fix
this, prompted by a kick in the pants from a patch from FUJITA
Tomonori <fujita.tomonori@lab.ntt.co.jp>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During cluster testing, one QP was not closed, as FIN is not handled
properly when its rexmit count expires or in some cases when RST is is
received after sending FIN. The reason is that the cm_id does not get
decremented under these conditions.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In nes_query_device(), max_qp_init_rd_atom is incorrectly set to
max_qp_wr. This was found when a test application had a dapl async
event error.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This prevents the memcpy() of a guid_entries element using a negative index.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement toleration of dynamic memory operations and 16 GB gigantic
pages, where "toleration" means that the driver can cope with dynamic
memory operations that happen before the driver is loaded. While the
ehca driver is loaded, dynamic memory operations are still prohibited
by returning NOTIFY_BAD from the memory notifier.
On module load the driver walks through available system memory,
checks for available memory ranges and then registers the kernel
internal memory region accordingly. The translation of address ranges
is implemented via a 3-level busmap.
Signed-off-by: Hannes Hering <hering2@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In the near future, the driver core is going to not allow direct access
to the driver_data pointer in struct device. Instead, the functions
dev_get_drvdata() and dev_set_drvdata() should be used. These functions
have been around since the beginning, so are backwards compatible with
all older kernel versions.
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: general@lists.openfabrics.org
Cc: Christoph Raisch <raisch@de.ibm.com>
Acked-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In the near future, the driver core is going to not allow direct access
to the driver_data pointer in struct device. Instead, the functions
dev_get_drvdata() and dev_set_drvdata() should be used. These functions
have been around since the beginning, so are backwards compatible with
all older kernel versions.
Cc: general@lists.openfabrics.org
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
mlx4_core: Don't double-free IRQs when falling back from MSI-X to INTx
IB/mthca: Don't double-free IRQs when falling back from MSI-X to INTx
IB/mlx4: Add strong ordering to local inval and fast reg work requests
IB/ehca: Remove superfluous bitmasks from QP control block
RDMA/cxgb3: Limit fast register size based on T3 limitations
RDMA/cxgb3: Report correct port state and MTU
mlx4_core: Add module parameter for number of MTTs per segment
IB/mthca: Add module parameter for number of MTTs per segment
RDMA/nes: Fix off-by-one bugs in reset_adapter_ne020() and init_serdes()
infiniband: Remove void casts
IB/ehca: Increment version number
IB/ehca: Remove unnecessary memory operations for userspace queue pairs
IB/ehca: Fall back to vmalloc() for big allocations
IB/ehca: Replace vmalloc() with kmalloc() for queue allocation
When both MSI-X and legacy INTx fail to generate an interrupt, the
driver frees the MSI-X interrupts twice. Fix this by clearing the
have_irq flag for the MSI-X interrupts when they are freed the first
time.
Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
Tested-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ConnectX Programmer's Reference Manual states that the "SO" bit
must be set when posting Fast Register and Local Invalidate send work
requests. When this bit is set, the work request will be executed
only after all previous work requests on the send queue have been
executed. (If the bit is not set, Fast Register and Local Invalidate
WQEs may begin execution too early, which violates the defined
semantics for these operations)
This fixes the issue with NFS/RDMA reported in
<http://lists.openfabrics.org/pipermail/general/2009-April/059253.html>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
All the fields in the control block are nicely right-aligned, so no
masking is necessary.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Last two drivers that need skb->dst in their start_xmit() function
Tell dev_hard_start_xmit() to no release it by unsetting IFF_XMIT_DST_RELEASE
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
T3 firmware only supports one WRs worth of page list for fast register
work requests. The driver currently allows 2 WRs worth, which
doesn't work for T3, so reduce the limit in the driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current MTT allocator uses kmalloc() to allocate a buffer for its
buddy allocator, and thus is limited in the amount of MTT segments
that it can control. As a result, the size of memory that can be
registered is limited too. This patch uses a module parameter to
control the number of MTT entries that each segment represents,
allowing more memory to be registered with the same number of
segments.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a task did not complete normally due to a TMF, libiscsi will
now complete the task with the state ISCSI_TASK_ABRT_TMF. Drivers
like bnx2i that need to free resources if a command did not complete normally
can then check the task state. If a driver does not need to send
a special command if we have dropped the session then they can check
for ISCSI_TASK_ABRT_SESS_RECOV.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
When we create the tcp/ip connection by calling ep_connect, we currently
just go by the routing table info.
I think there are two problems with this.
1. Some drivers do not have access to a routing table. Some drivers like
qla4xxx do not even know about other ports.
2. If you have two initiator ports on the same subnet, the user may have
set things up so that session1 was supposed to be run through port1. and
session2 was supposed to be run through port2. It looks like we could
end with both sessions going through one of the ports.
Fixes for cxgb3i from Karen Xie.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Network device sysfs files that grab the rtnl_lock unconditionally
will deadlock if accessed when the network device is being
unregistered. So use trylock and syscall_restart to avoid this
deadlock.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With a postfix increment, i is incremented one past 10K/5K before the
loop ends, so the error messages will be displayed too soon if the
test succeeds on the last iteration. Fix the comparisons to be >
instead of >=.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The queue map for flush completion circumvention is only used for
kernel space queue pairs. This patch skips the allocation of the
queue maps in case the QP is created for userspace. In addition, this
patch does not iomap the galpas for kernel usage if the queue pair is
only used in userspace. These changes will improve the performance of
creation of userspace queue pairs.
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In case of large queue pairs there is the possibillity of allocation
failures due to memory fragmentation when using kmalloc(). To ensure
the memory is allocated even if kmalloc() can not find chunks which
are big enough, we fall back to allocating the memory with vmalloc().
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To improve performance of driver resource allocation, replace
vmalloc() calls with kmalloc().
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Don't overwrite fast registration page list when posting work request
RDMA/cxgb3: Don't complete flushed send work requests twice
The low-level mlx4 driver modified the page-list addresses for fast
register work requests post send to big-endian, and set a "present"
bit. This caused problems later when the consumer attempted to unmap
the pages using the page-list (using the list addresses which were
assumed to be still in CPU-endian order). Fix the mlx4 driver to
allocate two buffers and use a private buffer for the hardware-format
bus addresses.
This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1571>,
an NFS/RDMA server crash. The cause of the crash was found by Vu Pham
of Mellanox. The fix is along the lines suggested by Steve Wise in
comment #21 in bug 1571.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the SQ is flushed, mark the flushed entries as not signaled so
the poll logic doesn't re-insert the CQ entry thinking its an out of
order completion.
The bug can cause the NFS/RDMA server to crash due to processing the
same completed work request twice.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If reg_phys_mem() fails, we need to free memory allocated for MPA
frame with private data before returning the error. Also move
nes_add_ref() after the reg_phys_mem() is successful.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Running large cluster setup, we are hanging after many hours of
testing. Fixing this required going over the code and making sure the
rexmit entry was properly removed based on the cm_node's state and
packet received. Also when receiving a FIN packet, check seq# and
make sure there were no errors before calling handle_fin().
Following are the changes done in nes_cm.c:
* handle_ack_pkt() needs to return error value, so in case of error,
handle_fin() is not called. Some cleanup done while going over the code.
* handle_rst_pkt(), handling of cm_node's NES_CM_STATE_LAST_ACK is missing.
* process_packet(), in case of FIN only packet is received, call
check_seq() before processing.
* in handle_fin_pkt(), we are calling cleanup_retrans_entry() for all
conditions, even if the packets need to be dropped.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Under heavy load with large cluster testing, it may take longer to
receive a response to MPA requests. Change the driver to wait longer
after each rexmit to max time value.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
check_seq() was not checking if the seq#s have wrapped. Fix it.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a connect request comes, apbvt should only be set for
non-loopback connections.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove the NES_DEBUG that is causing the compile warning about an
unused variable when INFINIBAND_NES_DEBUG is not enabled.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
/sys/class/infiniband/nes?/fw_ver is not displaying firmware version
properly (it shows 0.0.0 with the current code). Fill in the correct
firmware version number.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With updated PHY firmware for SFP_D, setting the trace length to 1
inch for SFP_D provides a more stable link.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Enable repause timer for port 1. Without this setting, under stress,
the chip may misbehave.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In commit 1b949324 ("RDMA/nes: Fix SFP+ PHY initialization") there is
a mistake in the clean up code that removed port 1 CDR loop filter
settings for 10G cards other than CX4. Put the correct setting back
for appropriate PHY types.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change thermo mitigation code to flip the SerDes1 reference clock to
internal, to match the change in commit a4849fc1 ("RDMA/nes: Add
wide_ppm_offset parm for switch compatibility").
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set target can queue limit to the number of preallocated
session tasks we have.
This along with the cxgb3i can_queue patch will fix a throughput
problem where it could only queue one LU worth of data at a time.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
In error paths where a CQ is not created, pbl is not freeed properly.
In nes_destroy_cq(), add the corresponding check for nescq->mcrqf to
not call nes_free_resource() when it is already done in nes_create_cq().
Signed-off-by: Miroslaw Walukiewicz <miroslaw.walukiewicz@intel.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commands INIT_HCA, CLOSE_HCA, SYS_EN, SYS_DIS, and CLOSE_IB all have 1
second timeouts. For INIT_HCA this causes problems when had more than
2^18 are QPs configured, since the command takes more than 1 second to
complete.
All other commands have 60-second timeouts. This patch makes the
above commands consistent with the rest of the commands (and with the
chip documentation).
This patch is an expansion of a patch from Arthur Kepner
<akepner@sgi.com> fixing just the INIT_HCA timeout.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
QP attributes must stay initialized when moving back to IDLE. Zeroing
them will crash the system in _flush_qp() if the QP is subsequently
moved to ERROR and back to IDLE.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code incorrectly failed memory registration if the buffer was not
page aligned. Also, the length field is mangled causing the hardware
to think the registration is much larger than it really is.
The fix is to remove the page alignment restriction as well the
incorrect length adjustment. Also make sure that all buffers after
the first start at a page boundary, and all buffers except the last
end on a page boundary.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Initialize pbl_count_256 to 0 to get rid of the warning:
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_reg_mr':
drivers/infiniband/hw/nes/nes_verbs.c:1955: warning: 'pbl_count_256' may be used uninitialized in this function
Reported-by: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If NAPI is enabled while IPoIB's CQ is being drained, it creates a
race on priv->ibwc between ipoib_poll() and ipoib_drain_cq(), leading
to memory corruption.
The solution is to enable/disable NAPI in ipoib_ib_dev_{open/stop}()
instead of in ipoib_{open/stop}(), and sync NAPI on the INITIALIZED
flag instead on the ADMIN_UP flag. This way NAPI will be disabled when
ipoib_drain_cq() is called.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1587>.
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
NFS/RDMA currently fails to set up connections if peer2peer is on.
This is due to the fact that the NFS/RDMA client sets its ORD to 0.
If peer2peer is set, make sure the active side ORD is >= 1 and the
passive side IRD is >=1.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/nes: Add support for new SFP+ PHY
RDMA/nes: Add wide_ppm_offset parm for switch compatibility
RDMA/nes: Fix SFP+ PHY initialization
RDMA/nes: Fix nes_nic_cm_xmit() error handling
RDMA/nes: Fix error handling issues
RDMA/nes: Fix incorrect casts on 32-bit architectures
IPoIB: Document newish features
RDMA/cma: Create cm id even when IB port is down
RDMA/cma: Use rate from IPoIB broadcast when joining IPoIB multicast groups
IPoIB: Avoid free_netdev() BUG when destroying a child interface
mlx4_core: Don't leak mailbox for SET_PORT on Ethernet ports
RDMA/cxgb3: Release dependent resources only when endpoint memory is freed.
RDMA/cxgb3: Handle EEH events
IB/mlx4: Use pgprot_writecombine() for BlueFlame pages
Add new register settings for new SFP+ PHY/firmware.
Add new PHY to to nes_netdev_get/set_settings.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have observed unstable link with a new BNT switch.
Add wide_ppm_offset parameter to allow the user to control the clock
ppm offset on the CX4 interface for better compatibility. Default is
100ppm, setting it to 1 will increase it to 300ppm. Change default
SerDes1 reference clock to external source.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
SFP+ PHY initialization has very long delays, incorrect settings for
direct attach copper cables, and inconsistent link detection.
Adjust delays to the minimum required by the PHY. Worst case is now
less than 4 seconds. Add new register settings for direct attach
cables. Change link detection logic to use two new registers for more
consistent link state detection. Reorganize code to shorten line
length.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We are getting crash or hung situation when we are running network
cable pull tests during RDMA traffic.
In schedule_nes_timer(), we return an error if nes_nic_cm_xmit()
returns failure. This is changed to success as skb is being put on
the timer routines to be processed later. In send_syn() case, we are
indicating connect failure once from nes_connect() and the other when
the rexmit retries expires.
The other issue is skb->users which we are incrementing before calling
nes_nic_cm_xmit() which calls dev_queue_xmit() but in case of failure
we are decrementing the skb->users at the same time putting the skb on
the rexmit path. Even if dev_queue_xmit() fails, the skb->users is
decremented already. We are removing the decrement of skb->users in
case of failure from both schedule_nes_timer() as well as from
nes_cm_timer_tick().
There is also extra check in nes_cm_timer_tick() for rexmit failure
which does a break from the loop is removed. This causes problem as
the other nodes have their cm_node->ref_count incremented and are not
processed.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix issues found by static code analysis:
(1) Check if cm_node was successfully created for loopback connection.
(2) schedule_nes_timer() does not free up allocated memory after
encountering an error. There is a WARN_ON() for this condition.
(3) there is a cm_node->freed flag which is set but not used.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The were some incorrect casts to unsigned long that caused 64-bit values
to be truncated on 32-bit architectures and made the driver pass invalid
adresses and lengths to the hardware. The problems were primarily seen
with kernels with highmem configured but some could show up in
non-highmem kernels, too.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When doing rdma_resolve_addr(), if the relevant IB port is down, the
function fails and the cm_id is not bound to the correct device.
Therefore, application does not have a device handle and cannot wait
for the port to become active. The function fails because the
underlying IPoIB interface is not joined to the broadcast group and
therefore the SA does not have a multicast record to take a Q_Key
from.
The fix is to use lazy Q_Key resolution - cma_set_qkey() will set
id_priv->qkey if it was not set, and will be called just before the
Q_Key is really required.
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace all DMA_32BIT_MASK macro with DMA_BIT_MASK(32)
Signed-off-by: Yang Hongyang<yanghy@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace all DMA_64BIT_MASK macro with DMA_BIT_MASK(64)
Signed-off-by: Yang Hongyang<yanghy@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When joining an IPoIB multicast group, use the same rate as in the
broadcast group. Otherwise, if the RDMA CM creates this group before
IPoIB does, it might get a different rate. This will cause IPoIB to
fail joining to the same group later on, because IPoIB uses strict
rate selection.
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have to release the RTNL before calling free_netdev() so that the
device state has a chance to become NETREG_UNREGISTERED. Otherwise
when removing a child interface, we hit the BUG() that tests the
device state in free_netdev().
Reported-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxgb3 l2t entry, hwtid, and dst entry were being released before
all the iwch_ep references were released. This can cause a crash in
t3_l2t_send_slow() and other places where the l2t entry is used.
The fix is to defer releasing these resources until all endpoint
references are gone.
Details:
- move flags field to the iwch_ep_common struct.
- add a flag indicating resources are to be released.
- release resources at endpoint free time instead of close/abort time.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- wrap calls into cxgb3 and fail them if we're in the middle
of a PCI EEH event.
- correctly unwind and release endpoint and other resources when
we are in an EEH event.
- dispatch IB_EVENT_DEVICE_FATAL event when cxgb3 notifies iw_cxgb3 of
a fatal error.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PAT work on x86 has finally made pgprot_writecombine() a usable API
for modular drivers. As the comment indicates, this is exactly what we
want to use in mlx4_ib to map BlueFlame pages up to userspace, since
using WC for these pages improves small message latency significantly.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When net-next and infiniband were merged upstream, each branch deleted
one of a pair of adjacent lines from nes_nic.c, but when Linus fixed the
conflict up, he brought back both of the lines. Fix up to the intended
final tree state.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits)
ixgbe: Allow Priority Flow Control settings to survive a device reset
net: core: remove unneeded include in net/core/utils.c.
e1000e: update version number
e1000e: fix close interrupt race
e1000e: fix loss of multicast packets
e1000e: commonize tx cleanup routine to match e1000 & igb
netfilter: fix nf_logger name in ebt_ulog.
netfilter: fix warning in ebt_ulog init function.
netfilter: fix warning about invalid const usage
e1000: fix close race with interrupt
e1000: cleanup clean_tx_irq routine so that it completely cleans ring
e1000: fix tx hang detect logic and address dma mapping issues
bridge: bad error handling when adding invalid ether address
bonding: select current active slave when enslaving device for mode tlb and alb
gianfar: reallocate skb when headroom is not enough for fcb
Bump release date to 25Mar2009 and version to 0.22
r6040: Fix second PHY address
qeth: fix wait_event_timeout handling
qeth: check for completion of a running recovery
qeth: unregister MAC addresses during recovery.
...
Manually fixed up conflicts in:
drivers/infiniband/hw/cxgb3/cxio_hal.h
drivers/infiniband/hw/nes/nes_nic.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (30 commits)
RDMA/cxgb3: Enforce required firmware
IB/mlx4: Unregister IB device prior to CLOSE PORT command
mlx4_core: Add link type autosensing
mlx4_core: Don't perform SET_PORT command for Ethernet ports
RDMA/nes: Handle MPA Reject message properly
RDMA/nes: Improve use of PBLs
RDMA/nes: Remove LLTX
RDMA/nes: Inform hardware that asynchronous event has been handled
RDMA/nes: Fix tmp_addr compilation warning
RDMA/nes: Report correct vendor_id and vendor_part_id
RDMA/nes: Update copyright to new legal entity and year
RDMA/nes: Account for freed PBL after HW operation
IB: Remove useless ibdev_is_alive() tests from sysfs code
IB/sa_query: Fix AH leak due to update_sm_ah() race
IB/mad: Fix ib_post_send_mad() returning 0 with no generate send comp
IB/mad: initialize mad_agent_priv before putting on lists
IB/mad: Fix null pointer dereference in local_completions()
IB/mad: Fix RMPP header RRespTime manipulation
IB/iser: Remove hard setting of path MTU
mlx4_core: Add device IDs for MT25458 10GigE devices
...
The cxgb3 NIC driver can handle more firmware versions than iw_cxgb3,
and since commit 8207befa ("cxgb3: untie strict FW matching") cxgb3
will load with firmware versions that iw_cxgb3 can't handle. The FW
major number indicates a specific interface between the FW and
iw_cxgb3. Thus if the major number of the running firmware does not
match the required version compiled into iw_cxgb3, then iw_cxgb3 must
not register that device.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Also, removed unnecessary memset() since alloc_netdev returns
zeroed memory.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert this driver to new net_device_ops infrastructure.
Also use default net_device get-stats infrastructure
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the ConnectX programmer's reference manual, all
operations should be stopped, all QPs should be torn down and all WQEs
flushed before the CLOSE_PORT command is invoked. In some cases
reversing the order of operations (as implemented now) could cause
a loss of completions.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We do not need to have llds set the host no for the session's
parent, because we know the session's parent is going to be
the host. This removes it from the session creation callback
and converts the drivers.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
The qdepth setting was useful when we needed libiscsi to verify
the setting. Now we just need to make sure if older tools
passed in zero then we need to set some default.
So this patch just has us use the sht->cmd_per_lun or if
for LLD does a host per session then we can set it on per
host basis.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We were using the shost work queue which ended up being
a little akward since all iscsi hosts need a thread for
scanning, but only drivers hooked into libiscsi need
a workqueue for transmitting. So this patch moves the
xmit workqueue to the lib.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
There is no need to cap the queue depth in the modules. We set
this in userspace and can do that there. For performance testing
with ram based targets, this is helpful since we can have very
high queue depths.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
iser has its own logging inrfastrucutre. Convert it to use
it instead of libiscsi.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
STag zero is a special STag that allows consumers to access any bus
address without registering memory. The nes driver unfortunately
allows STag zero to be used even with QPs created by unprivileged
userspace consumers, which means that any process with direct verbs
access to the nes device can read and write any memory accessible to
the underlying PCI device (usually any memory in the system). Such
access is usually given for cluster software such as MPI to use, so
this is a local privilege escalation bug on most systems running this
driver.
The driver was using STag zero to receive the last streaming mode
data; to allow STag zero to be disabled for unprivileged QPs, the
driver now registers a special MR for this data.
Cc: <stable@kernel.org>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While doing testing, there are failures as MPA Reject call is not
handled. To handle MPA Reject call, following changes are done:
*Handle inbound/outbound MPA Reject response message.
When nes_reject() is called for pending MPA request reply,
send the MPA Reject message to its peer (active
side)cm_node. The peer cm_node (active side) will indicate
Reject message event for the pending Connect Request.
*Handle MPA Reject response message for loopback connections and listener.
When MPA Request is rejected, check if it is a loopback
connection and if it is then it will send Reject message event
to its peer loopback node. Also when destroying listener,
check if the cm_nodes for that listener are loopback or not.
*Add gracefull connection close with the MPA Reject response message.
Send gracefull close (FIN, FIN ACK..) to terminate the cm_nodes.
*Some code re-org while making the above changes.
Removed recv_list and recv_list_lock from the cm_node
structure as there can be only one receive close entry on the
timer. Also implemented handle_recv_entry() as receive close
entry is processed from both nes_rem_ref_cm_node() as well as
nes_cm_timer_tick().
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Two level 256 byte PBLs was not implemented so the driver could report
out of memory when in fact there were PBLs still available.
This solution prefers to use 4KB PBLs over two level 256B PBLs until
the number of 4KB PBLs falls below a threshold. At this point the 4KB
PBL structure is converted to use 256B PBLs which prevents the driver
from running out of 4KB PBLs too quickly.
Signed-off-by: Don Wood <donald.e.wood@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>