pci_{,un}map_sg are deprecated and replaced by dma_{,un}map_sg. This is
especially relevant since the rest of the driver uses the DMA API. Fix
the driver to use the replacement APIs.
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch solves a crash at the time of mlx4 driver unload or system
shutdown. The crash occurs because dma_alloc_coherent() returns one
value in mlx4_alloc_icm_coherent(), but a different value is passed to
dma_free_coherent() in mlx4_free_icm_coherent(). In turn this is because
when allocated, that pointer is passed to sg_set_buf() to record it,
then when freed it is re-calculated by calling
lowmem_page_address(sg_page()) which returns a different value. Solve
this by recording the value that dma_alloc_coherent() returns, and
passing this to dma_free_coherent().
This patch is roughly equivalent to commit 378efe798e ("RDMA/hns: Get
rid of page operation after dma_alloc_coherent").
Based-on-code-from: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DIV_ROUND_UP has implemented the code-opened function. Therefore, just
replace the implementation with DIV_ROUND_UP.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1383cb8103 ("mlx4_core: allocate ICM memory in page size chunks")
brought two regressions caught in our regression suite.
The big one is an additional cost of 256 bytes of overhead per 4096 bytes,
or 6.25 % which is unacceptable since ICM can be pretty large.
This comes from having to allocate one struct mlx4_icm_chunk (256 bytes)
per MLX4_TABLE_CHUNK, which the buggy commit shrank to 4KB
(instead of prior 256KB)
Note that mlx4_alloc_icm() is already able to try high order allocations
and fallback to low-order allocations under high memory pressure.
Most of these allocations happen right after boot time, when we get
plenty of non fragmented memory, there is really no point being so
pessimistic and break huge pages into order-0 ones just for fun.
We only have to tweak gfp_mask a bit, to help falling back faster,
without risking OOM killings.
Second regression is an KASAN fault, that will need further investigations.
Fixes: 1383cb8103 ("mlx4_core: allocate ICM memory in page size chunks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Cc: John Sperbeck <jsperbeck@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Cc: Qing Huang <qing.huang@oracle.com>
Cc: Daniel Jurgens <danielj@mellanox.com>
Cc: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.
When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.
Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...
With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.
E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.
The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Acked-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel coding style is to treat sizeof as a function
(ie. with parenthesis) not as an operator.
Also use kcalloc and kmalloc_array
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller to the driver marks GFP_NOIO allocations with help
of memalloc_noio-* calls now. This makes redundant to pass down
to the driver gfp flags, which can be GFP_KERNEL only.
The patch removes the gfp flags argument and updates all driver paths.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address alignment in advance and performing proper
folding in case of error.
Fixes: 5b0bf5e25e ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise sparse gives a bunch of warnings like
drivers/net/ethernet/mellanox/mlx4/srq.c:110:66: sparse: incorrect type in argument 4 (different base types)
drivers/net/ethernet/mellanox/mlx4/srq.c:110:66: expected int [signed] gfp
drivers/net/ethernet/mellanox/mlx4/srq.c:110:66: got restricted gfp_t
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the various routines used to allocate memory resources which
serve QPs in mlx4 to get an input GFP directive. Have the Ethernet
driver to use GFP_KERNEL in it's QP allocations as done prior to this
commit, and the IB driver to use GFP_NOIO when the IB verbs
IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This is done to optimize FW/HW access to host memory.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds on the fixes done in commits 89dd86db78 ("mlx4_core:
Allow large mlx4_buddy bitmaps") and 3de819e6b6 ("mlx4_core: Fix
integer overflow issues around MTT table") so that memory registration
of up to 8TB (log_num_mtt=31) finally works.
It fixes integer overflows in a few mlx4_table_yyy routines in icm.c
by using a u64 intermediate variable, and int/uint issues that caused
table indexes to become nagive by setting some variables to be u32
instead of int. These problems cause crashes when a user attempted to
register > 512GB of RAM.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix some issues around int variables used in data structures related
to memory registration.
Handle int overflow in mlx4_init_icm_table by using a u64 intermediate
variable and changing struct mlx4_icm_table num_obj field to be u32.
Change some more fields/variables to use u32 instead of int to prevent
a case where the variable becomes negative when bit 31 is set.
Also subtract log_mtts_per_seg from the exponent when computing
num_mtt, since its added later on in that very same code area.
This and the previous commit fixes some issues which actually prevent
commit db5a7a65c0 ("mlx4_core: Scale size of MTT table with system
RAM") from working. Now, when the number of MTTs is scaled with the
size of the RAM we can map up to 8TB.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In mlx4_init_icm_table(), free the allocated table if we failed to
allocate memory to its entries.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For SRIOV, some Hypervisor commands can be executed directly (native = 1).
Others should go through the command wrapper flow (for tracking resource
usage, for example, or for changing some HCA configurations that slaves
need to be notified of).
This patch sets the groundwork for this capability -- adding the correct
value of "native" in each case.
Note that if SRIOV is not activated, this parameter has no effect.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and
make the necessary Kconfig and Makefile changes.
CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>