T3 firmware only supports one WRs worth of page list for fast register
work requests. The driver currently allows 2 WRs worth, which
doesn't work for T3, so reduce the limit in the driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the SQ is flushed, mark the flushed entries as not signaled so
the poll logic doesn't re-insert the CQ entry thinking its an out of
order completion.
The bug can cause the NFS/RDMA server to crash due to processing the
same completed work request twice.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
QP attributes must stay initialized when moving back to IDLE. Zeroing
them will crash the system in _flush_qp() if the QP is subsequently
moved to ERROR and back to IDLE.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
NFS/RDMA currently fails to set up connections if peer2peer is on.
This is due to the fact that the NFS/RDMA client sets its ORD to 0.
If peer2peer is set, make sure the active side ORD is >= 1 and the
passive side IRD is >=1.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxgb3 l2t entry, hwtid, and dst entry were being released before
all the iwch_ep references were released. This can cause a crash in
t3_l2t_send_slow() and other places where the l2t entry is used.
The fix is to defer releasing these resources until all endpoint
references are gone.
Details:
- move flags field to the iwch_ep_common struct.
- add a flag indicating resources are to be released.
- release resources at endpoint free time instead of close/abort time.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- wrap calls into cxgb3 and fail them if we're in the middle
of a PCI EEH event.
- correctly unwind and release endpoint and other resources when
we are in an EEH event.
- dispatch IB_EVENT_DEVICE_FATAL event when cxgb3 notifies iw_cxgb3 of
a fatal error.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits)
ixgbe: Allow Priority Flow Control settings to survive a device reset
net: core: remove unneeded include in net/core/utils.c.
e1000e: update version number
e1000e: fix close interrupt race
e1000e: fix loss of multicast packets
e1000e: commonize tx cleanup routine to match e1000 & igb
netfilter: fix nf_logger name in ebt_ulog.
netfilter: fix warning in ebt_ulog init function.
netfilter: fix warning about invalid const usage
e1000: fix close race with interrupt
e1000: cleanup clean_tx_irq routine so that it completely cleans ring
e1000: fix tx hang detect logic and address dma mapping issues
bridge: bad error handling when adding invalid ether address
bonding: select current active slave when enslaving device for mode tlb and alb
gianfar: reallocate skb when headroom is not enough for fcb
Bump release date to 25Mar2009 and version to 0.22
r6040: Fix second PHY address
qeth: fix wait_event_timeout handling
qeth: check for completion of a running recovery
qeth: unregister MAC addresses during recovery.
...
Manually fixed up conflicts in:
drivers/infiniband/hw/cxgb3/cxio_hal.h
drivers/infiniband/hw/nes/nes_nic.c
The cxgb3 NIC driver can handle more firmware versions than iw_cxgb3,
and since commit 8207befa ("cxgb3: untie strict FW matching") cxgb3
will load with firmware versions that iw_cxgb3 can't handle. The FW
major number indicates a specific interface between the FW and
iw_cxgb3. Thus if the major number of the running firmware does not
match the required version compiled into iw_cxgb3, then iw_cxgb3 must
not register that device.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove modulo usage to avoid a divide in the fast path (not all
gcc versions do strength reduction here).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The poll and flush code needs to handle all send opcodes: SEND,
SEND_WITH_SE, SEND_WITH_INV, and SEND_WITH_SE_INV.
Ignore TERM indications if the connection already gone.
Ignore HW receive completions if the RQ is empty.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Freeze activity when notified that the underlying chip
is getting reset on a EEH event or fatal error.
Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the iw_cxgb3 module's cxgb3_client "add" func gets called by the
cxgb3 module, the iwarp driver ends up calling the ethtool ops
get_drvinfo function in cxgb3 to get the fw version and other info.
Currently the iwarp driver grabs the rtnl lock around this down call
to serialize. As of 2.6.27 or so, things changed such that the rtnl
lock is held around the call to the netdev driver open function. Also
the cxgb3_client "add" function doesn't get called if the device is
down.
So, if you load cxgb3, then load iw_cxgb3, then ifconfig up the
device, the iw_cxgb3 add func gets called with the rtnl_lock held. If
you load cxgb3, ifconfig up the device, then load iw_cxgb3, the add
func gets called without the rtnl_lock held. The former causes the
deadlock, the latter does not.
In addition, there are iw_cxgb3 sysfs handlers that also can call down
into cxgb3 to gather the fw and hw versions. These can be called
concurrently on different processors and at any time. Thus we need to
push this serialization down in the cxgb3 driver get_drvinfo func.
The fix is to remove rtnl lock usage, and use a per-device lock in cxgb3.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The array wqe->read.reserved has only two entries, but
iwch_post_zb_read() sets [0], [1], and [2], which is one too many.
This is harmless since it runs into the next field, rem_stag, which is
initialized correctly immediately after, but we might as well get
things right, especially since it makes the code smaller.
This was spotted by the Coverity checker (CID 2475).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
The error path in iwch_connect() can fail to drop the cmid reference,
which will cause the process to hang when destroying the cmid.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When running ibv_devinfo, the active_mtu returned is garbage. This is
due to the field not being populated in the query_port function in the
driver. The patch below populates the active_mtu field with a MTU of
2k. It also zeros the struct, so that any new additions to it will
return 0.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Running 'ifconfig up' on the cxgb3 interface with iw_cxgb3 loaded
causes a deadlock. The rtnl lock is already held in this path. The
function fw_supports_fastreg() was introduced in 2.6.27 to
conditionally set the IB_DEVICE_MEM_MGT_EXTENSIONS bit iff the
firmware was at 7.0 or greater, and this function also acquires the
rtnl lock and which thus causes a deadlock. Further, if iw_cxgb3 is
loaded _after_ the nic interface is brought up, then the deadlock does
not occur and therefore fw_supports_fastreg() does need to grab the
rtnl lock in that path.
It turns out this code is all useless anyway. The low level driver
will NOT allow the open if the firmware isn't 7.0, so iw_cxgb3 can
always set the MEM_MGT_EXTENSIONS bit. Simplify...
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- MWs don't have local read/write permissions.
- Set the MW_BIND enabled bit if a MR has MW_BIND access.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Set the stag0 and fastreg capability bits only for kernel qps.
- QP_PRIV flag is no longer used, so don't set it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Handling the zero STag in receive work request requires some extra
logic in the driver:
- Only set the QP_PRIV bit for kernel mode QPs.
- Add a zero STag build function for recv wrs. The uP needs a PBL
allocated and passed down in the recv WR so it can construct a HW
PBL for the zero STag S/G entries. Note: we need to place a few
restrictions on zero STag usage because of this:
1) all SGEs in a recv WR must either be zero STag or not. No mixing.
2) an individual SGE length cannot exceed 128MB for a zero-stag SGE.
This should be OK since it's not really practical to allocate
such a large chunk of pinned contiguous DMA mapped memory.
- Add an optimized non-zero-STag recv wr format for kernel users.
This is needed to optimize both zero and non-zero STag cracking in
the recv path for kernel users.
- Remove the iwch_ prefix from the static build functions.
- Bump required FW version.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
STag support and IB HCAs to indicate reserved L_Key support.
- Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
this in with the appropriate local DMA L_Key (if they support it).
- Fix up the drivers using this flag.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 does not currently report the page size capabilities, and
incorrectly reports them internally.
This version changes the bit-shifting to a static value (per Steve's
request).
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Add a new rdma ctl command called RDMA_GET_MIB to the cxgb3 low
level driver to obtain the protocol mib from the rnic hardware.
- Add new iw_cxgb3 provider method to get the MIB from the low level
driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The members struct iwch_rnic_attributes.vendor_id and .vendor_part_id
are write-only, so we might as well get rid of them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
- set fw_ver
- set hw_ver
- set max_qp_wr to something reasonable
- set max_cqe to something reasonable
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit if fw supports it.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add local_inv work request support.
- add send_with_inv and send_with_se_inv work request support.
- removed useless duplicate enums/defines for TPT/MW/MR stuff.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The change to iwch_provider.c in commit f4e91eb4 ("IB: convert struct
class_device to struct device") undid the fix done in commit 7f049f2f
("RDMA/cxgb3: Hold rtnl_lock() around ethtool get_drvinfo call"). It
removed the calls to rtnl_lock() that serialized the iw_cxgb3 ethtool
ops calls into the cxgb3 driver. This locking is needed to avoid
messing up the internal state of the cxgb3 driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send':
drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function
This is what akpm describes as "the dopey
gcc-doesn't-know-that-foo(&var)-writes-to-var problem."
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
cxio_flush_sq() was failing to wrap around the software send queue
causing garbage completion entries on a flush operation.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently, iw_cxgb3 is severely limited on the amount of userspace
memory that can be registered in in a single memory region, which
causes big problems for applications that expect to be able to
register 100s of MB.
The problem is that the driver uses a single kmalloc()ed buffer to
hold the physical buffer list (PBL) for the entire memory region
during registration, which means that 8 bytes of contiguous memory are
required for each page of memory being registered. For example, a 64
MB registration will require 128 KB of contiguous memory with 4 KB
pages, and it unlikely that such an allocation will succeed on a busy
system.
This is purely a driver problem: the temporary page list buffer is not
needed by the hardware, so we can fix this by writing the PBL to the
hardware in page-sized chunks rather than all at once. We do this by
splitting the memory registration operation up into several steps:
- Allocate PBL space in adapter memory for the full registration
- Copy PBL to adapter memory in chunks
- Allocate STag and enable memory region
This also allows several other cleanups to the __cxio_tpt_op()
interface and related parts of the driver.
This change leaves the reregister memory region and memory window
operations broken, but they already didn't work due to other
longstanding bugs, so fixing them will be left to a later patch.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
chunks. This limits the largest single allocation that can be done to
the same size, which means that with 4 KB pages, each of which takes 8
bytes of PBL memory, the largest memory region that can be allocated
is 1 GB (256K PBL entries * 4 KB/entry).
Remove this limit by adding all the PBL memory in a single gen_pool
chunk, if possible. Add code that falls back to smaller chunks if
gen_pool_add() fails, which can happen if there is not sufficient
contiguous lowmem for the internal gen_pool bitmap.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Testing on large clusters shows its way too short at 10 secs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove bad BUG_ON() that can trigger in correct operation from
close_con_rpl(). It is possible to get a close_rpl message on a dead
connection. The sequence is:
- host refs ep for close exchange
- host posts close_req
- hw posts PEER_ABORT from incoming RST
- host marks ep DEAD
- host posts ABORT_RPL and releases ep resources
- hw posts CLOSE_RPL
- host derefs ep and ep freed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Flush the QP only after the HW disables the connection. Currently
we flush the QP when transitioning to CLOSING. This exposes a race
condition where the HW can complete a RECV WR, for instance, -and-
the SW can flush that same WR.
- Only call CQ event handlers on flush IFF we actually flushed something.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI, Intel MPI and other applications don't respect the iWARP
requirement that the client (active) side of the connection send the
first RDMA message. This class of application connection setup is
called peer-to-peer. Typically once the connection is setup, _both_
sides want to send data.
This patch enables supporting peer-to-peer over the chelsio RNIC by
enforcing this iWARP requirement in the driver itself as part of RDMA
connection setup.
Connection setup is extended, when the peer2peer module option is 1,
such that the MPA initiator will send a 0B Read (the RTR) just after
connection setup. The MPA responder will suspend SQ processing until
the RTR message is received and reply-to.
In the longer term, this will be handled in a standardized way by
enhancing the MPA negotiation so peers can indicate whether they
want/need the RTR and what type of RTR (0B read, 0B write, or 0B send)
should be sent. This will be done by standardizing a few bits of the
private data in order to negotiate all this. However this patch
enables peer-to-peer applications now and allows most of the required
firmware and driver changes to be done and tested now.
Design:
- Add a module option, peer2peer, to enable this mode.
- New firmware support for peer-to-peer mode:
- a new bit in the rdma_init WR to tell it to do peer-2-peer
and what form of RTR message to send or expect.
- process _all_ preposted recvs before moving the connection
into rdma mode.
- passive side: defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the RTR is received.
- active side: expect and process the 0B read WR on offload TX
queue. Defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the 0B read WR is processed from the offload TX queue.
- If peer2peer is set, driver posts 0B read request on offload TX
queue just after posting the rdma_init WR to the offload TX queue.
- Add CQ poll logic to ignore unsolicitied read responses.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 only supports 4GB memory regions. The lustre RDMA code uses
this attribute and currently has to code around our bad setting.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI and other stress testing exposed a few bad bugs in handling
aborts in the middle of a normal close. Fix these by:
- serializing abort reply and peer abort processing with disconnect
processing
- warning (and ignoring) if ep timer is stopped when it wasn't running
- cleaning up disconnect path to correctly deal with aborting and
dead endpoints
- in iwch_modify_qp(), taking a ref on the ep before releasing the qp
lock if iwch_ep_disconnect() will be called. The ref is dropped
after calling disconnect.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
__FUNCTION__ is gcc-specific, use __func__ instead.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix sparse warnings about pointer signedness by using a signed int when
calling idr_get_new_above().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Because of a typo in iwch_accept_cr(), the cxgb3 connection handling
code programs the hardware IRD (incoming RDMA read queue depth) with
the value that is passed in for the ORD (outgoing RDMA read queue
depth). In particular this means that if an application passes in IRD
> 0 and ORD = 0 (which is a completely sane and valid thing to do for
an app that expects only incoming RDMA read requests), then the
hardware will end up programmed with IRD = 0 and the app will fail in
a mysterious way.
Fix this by using "ep->ird" instead of "ep->ord" in the intended place.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cxbg3 driver is unnecessarily decreasing the number of CQ entries by
one when creating a CQ. This will cause the CQ not to have as many
entries as requested by the user if the user requests a power of 2 size.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set cap.max_inline_data to the actual max inline data that the adapter
support, so that userspace apps see the right value returned.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A single entry (addr 0x10001000, size 0x2000) will get converted to
page address 0x10000000 with a page size of 0x4000. The code as it
stands doesn't address the single buffer case, but in fact it allows
the subsequent single-buffer special case to be eliminated entirely.
Because the mask now includes the (page adjusted) starting and ending
addresses, the general case works for the single buffer case as well.
Signed-off-by: Bryan Rosenburg <rosnbrg@us.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>