Code cleanup. Removed the size-based benchmark compile-time option as it has not been used in ages and clutters the code. Also improved accuracy and flexibility of architectural preprocessor macros to correctly identify build target capabilities in preparation for ARM port.

This commit is contained in:
Mark Gottscho 2015-04-23 12:54:33 -07:00
Родитель 7ffc78fe93
Коммит 5d537ae69e
37 изменённых файлов: 245 добавлений и 614 удалений

Просмотреть файл

@ -38,7 +38,7 @@ PROJECT_NAME = X-Mem
# could be handy for archiving the generated documentation or if some version
# control system is used.
PROJECT_NUMBER = 2.1.13
PROJECT_NUMBER = 2.1.14
# Using the PROJECT_BRIEF tag one can provide an optional one line description
# for a project that appears at the top of each page and should give viewer a

Просмотреть файл

@ -1,7 +1,7 @@
README
------------------------------------------------------------------------------------------------------------
X-Mem: Extensible Memory Benchmarking Tool v2.1.13
X-Mem: Extensible Memory Benchmarking Tool v2.1.14
------------------------------------------------------------------------------------------------------------
The flexible open-source research tool for characterizing memory hierarchy throughput, latency, and power.
@ -10,7 +10,7 @@ Originally authored by Mark Gottscho (Email: <mgottscho@ucla.edu>) as a Summer 2
This project is under active development. Stay tuned for more updates.
PROJECT REVISION DATE: April 22, 2015.
PROJECT REVISION DATE: April 23, 2015.
------------------------------------------------------------------------------------------------------------
LICENSE
@ -54,9 +54,8 @@ Extensibility: modularity via C++ object-oriented principles
- Supports rapid addition of new benchmark kernel routines
- Example: stream triad algorithm, impact of false sharing, etc. are possible with minor changes
Cross-platform: Currently implemented for Windows and GNU/Linux on x86-64, x86-64 with AVX extensions CPUs
Cross-platform: Currently implemented for Windows and GNU/Linux on x86, x86-64, and x86-64 with AVX extensions CPUs
- Designed to allow straightforward porting to other operating systems and ISAs
- 32-bit x86 port under development
- ARM port under development
Memory throughput:
@ -92,9 +91,8 @@ There are a few runtime prerequisites in order for the software to run correctly
HARDWARE:
- Intel x86-64 CPU with optional support for AVX extensions. AMD CPUs should also work although this has not been tested.
- COMING SOON: Intel 32-bit x86 CPU
- COMING SOON: ARMv7 CPUs
- Intel x86 or x86-64 CPU with optional support for AVX extensions. AMD CPUs should also work although this has not been tested.
- COMING SOON: ARM CPUs
WINDOWS:
@ -109,7 +107,7 @@ WINDOWS:
GNU/LINUX:
- GNU utilities with support for C++11. Tested with gcc 4.8.2 on Ubuntu 14.04 LTS for x86-64 CPU.
- GNU utilities with support for C++11. Tested with gcc 4.8.2 on Ubuntu 14.04 LTS for x86-64 CPU with AVX.
- libhugetlbfs. You can obtain it at <http://libhugetlbfs.sourceforge.net>. On Ubuntu systems, you can install using "sudo apt-get install libhugetlbfs0".
- Potentially, administrator privileges, if you plan to use the --large_pages option.
- During runtime, if the --large_pages option is selected, you may need to first manually ensure that large pages are available from the OS. This can be done by running "hugeadm --pool-list". It is recommended to set minimum pool to 1GB (in order to measure DRAM effectively). If needed, this can be done by running "hugeadm --pool-pages-min 2MB:512". Alternatively, run the linux_setup_runtime_hugetlbfs.sh script that is provided with X-Mem.
@ -313,7 +311,7 @@ WINDOWS:
GNU/LINUX:
- gcc with support for the C++11 standard. Tested with gcc version 4.8.2 on Ubuntu 14.04 LTS for x86-64.
- gcc with support for the C++11 standard. Tested with gcc version 4.8.2 on Ubuntu 14.04 LTS for x86-64 with AVX.
- Python 2.7. You can obtain it at <http://www.python.org>. On Ubuntu systems, you can install using "sudo apt-get install python2.7". You may need some other Python 2.7 packages as well.
- SCons build system. You can obtain it at <http://www.scons.org>. On Ubuntu systems, you can install using "sudo apt-get install scons". Build tested with SCons 2.3.4.
- Kernel support for large (huge) pages. This support can be verified on your Linux installation by running "grep hugetlbfs /proc/filesystems". If you do not have huge page support in your kernel, you can build a kernel with the appropriate options switched on: "CONFIG_HUGETLB_PAGE" and "CONFIG_HUGETLBFS".

Двоичные данные
X-Mem_Developer_Manual.pdf

Двоичный файл не отображается.

Двоичные данные
bin/xmem-linux-x64

Двоичный файл не отображается.

Двоичные данные
bin/xmem-linux-x64_avx

Двоичный файл не отображается.

Двоичные данные
bin/xmem-linux-x86

Двоичный файл не отображается.

Просмотреть файл

@ -45,9 +45,6 @@ Benchmark::Benchmark(
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,
@ -62,9 +59,6 @@ Benchmark::Benchmark(
_mem_array(mem_array),
_len(len),
_iterations(iterations),
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration(passes_per_iteration),
#endif
_num_worker_threads(num_worker_threads),
_mem_node(mem_node),
_cpu_node(cpu_node),
@ -272,12 +266,6 @@ uint32_t Benchmark::getIterations() const {
return _iterations;
}
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t Benchmark::getPassesPerIteration() const {
return _passes_per_iteration;
}
#endif
chunk_size_t Benchmark::getChunkSize() const {
return _chunk_size;
}

Просмотреть файл

@ -527,16 +527,9 @@ bool BenchmarkManager::__buildBenchmarks() {
//Add the throughput benchmark
benchmark_name = static_cast<std::ostringstream*>(&(std::ostringstream() << "Test #" << g_test_index << "T (Throughput)"))->str();
#ifdef USE_SIZE_BASED_BENCHMARKS
//Determine number of passes for each benchmark. This is working set size-dependent, to ensure the timed duration of each run is sufficiently long, but not too long.
size_t passes_per_iteration = compute_number_of_passes((mem_array_len / __config.getNumWorkerThreads()) / KB);
#endif
__tp_benchmarks.push_back(new ThroughputBenchmark(mem_array,
mem_array_len,
__config.getIterationsPerTest(),
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
__config.getNumWorkerThreads(),
mem_node,
cpu_node,
@ -556,16 +549,9 @@ bool BenchmarkManager::__buildBenchmarks() {
//Special case: number of worker threads is 1, only need 1 latency thread in general to do unloaded latency tests.
if (__config.getNumWorkerThreads() > 1 || __lat_benchmarks.size() < 1) {
benchmark_name = static_cast<std::ostringstream*>(&(std::ostringstream() << "Test #" << g_test_index << "L (Latency)"))->str();
#ifdef USE_SIZE_BASED_BENCHMARKS
//Determine number of passes for each benchmark. This is working set size-dependent, to ensure the timed duration of each run is sufficiently long, but not too long.
passes_per_iteration = compute_number_of_passes((mem_array_len / __config.getNumWorkerThreads()) / KB) / 4;
#endif
__lat_benchmarks.push_back(new LatencyBenchmark(mem_array,
mem_array_len,
__config.getIterationsPerTest(),
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
__config.getNumWorkerThreads(),
mem_node,
cpu_node,
@ -600,16 +586,9 @@ bool BenchmarkManager::__buildBenchmarks() {
//Add the throughput benchmark
benchmark_name = static_cast<std::ostringstream*>(&(std::ostringstream() << "Test #" << g_test_index << "T (Throughput)"))->str();
#ifdef USE_SIZE_BASED_BENCHMARKS
//Determine number of passes for each benchmark. This is working set size-dependent, to ensure the timed duration of each run is sufficiently long, but not too long.
size_t passes_per_iteration = compute_number_of_passes((mem_array_len / __config.getNumWorkerThreads()) / KB);
#endif
__tp_benchmarks.push_back(new ThroughputBenchmark(mem_array,
mem_array_len,
__config.getIterationsPerTest(),
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
__config.getNumWorkerThreads(),
mem_node,
cpu_node,
@ -628,16 +607,9 @@ bool BenchmarkManager::__buildBenchmarks() {
//Special case: number of worker threads is 1, only need 1 latency thread in general to do unloaded latency tests.
if (__config.getNumWorkerThreads() > 1 || __lat_benchmarks.size() < 1) {
benchmark_name = static_cast<std::ostringstream*>(&(std::ostringstream() << "Test #" << g_test_index << "L (Latency)"))->str();
#ifdef USE_SIZE_BASED_BENCHMARKS
//Determine number of passes for each benchmark. This is working set size-dependent, to ensure the timed duration of each run is sufficiently long, but not too long.
passes_per_iteration = compute_number_of_passes((mem_array_len / __config.getNumWorkerThreads()) / KB) / 4;
#endif
__lat_benchmarks.push_back(new LatencyBenchmark(mem_array,
mem_array_len,
__config.getIterationsPerTest(),
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
__config.getNumWorkerThreads(),
mem_node,
cpu_node,
@ -697,16 +669,9 @@ bool BenchmarkManager::runExtDelayInjectedLoadedLatencyBenchmark() {
std::string benchmark_name = static_cast<std::ostringstream*>(&(std::ostringstream() << "Test #" << g_test_index++ << "E" << EXT_NUM_DELAY_INJECTED_LOADED_LATENCY_BENCHMARK << " (Extension: Delay-Injected Loaded Latency)"))->str();
#ifdef USE_SIZE_BASED_BENCHMARKS
//Determine number of passes for each benchmark. This is working set size-dependent, to ensure the timed duration of each run is sufficiently long, but not too long.
passes_per_iteration = compute_number_of_passes((mem_array_len / __config.getNumWorkerThreads()) / KB) / 4;
#endif
del_lat_benchmarks.push_back(new DelayInjectedLoadedLatencyBenchmark(mem_array,
mem_array_len,
__config.getIterationsPerTest(),
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
__config.getNumWorkerThreads(),
mem_node,
cpu_node,

Просмотреть файл

@ -52,9 +52,6 @@ LatencyBenchmark::LatencyBenchmark(
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,
@ -69,9 +66,6 @@ LatencyBenchmark::LatencyBenchmark(
mem_array,
len,
iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
num_worker_threads,
mem_node,
cpu_node,
@ -303,9 +297,6 @@ bool LatencyBenchmark::_run_core() {
if (t == 0) { //special case: thread 0 is always latency thread
workers.push_back(new LatencyWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
lat_kernel_fptr,
lat_kernel_dummy_fptr,
cpu_id));
@ -313,18 +304,12 @@ bool LatencyBenchmark::_run_core() {
if (_pattern_mode == SEQUENTIAL)
workers.push_back(new LoadWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
load_kernel_fptr_seq,
load_kernel_dummy_fptr_seq,
cpu_id));
else if (_pattern_mode == RANDOM)
workers.push_back(new LoadWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
load_kernel_fptr_ran,
load_kernel_dummy_fptr_ran,
cpu_id));

Просмотреть файл

@ -49,9 +49,6 @@ using namespace xmem;
LatencyWorker::LatencyWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
RandomFunction kernel_fptr,
RandomFunction kernel_dummy_fptr,
int32_t cpu_affinity
@ -59,9 +56,6 @@ LatencyWorker::LatencyWorker(
MemoryWorker(
mem_array,
len,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
cpu_affinity
),
__kernel_fptr(kernel_fptr),
@ -87,22 +81,14 @@ void LatencyWorker::run() {
tick_t elapsed_dummy_ticks = 0;
tick_t adjusted_ticks = 0;
bool warning = false;
#ifdef USE_TIME_BASED_BENCHMARKS
void* mem_array = NULL;
size_t len = 0;
tick_t target_ticks = g_ticks_per_ms * BENCHMARK_DURATION_MS; //Rough target run duration in ticks
#endif
//Grab relevant setup state thread-safely and keep it local
if (_acquireLock(-1)) {
#ifdef USE_TIME_BASED_BENCHMARKS
mem_array = _mem_array;
len = _len;
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
passes = _passes_per_iteration;
#endif
bytes_per_pass = LATENCY_BENCHMARK_UNROLL_LENGTH * 8;
cpu_affinity = _cpu_affinity;
kernel_fptr = __kernel_fptr;
@ -134,7 +120,6 @@ void LatencyWorker::run() {
}
//Run benchmark
#ifdef USE_TIME_BASED_BENCHMARKS
//Run actual version of function and loop overhead
next_address = static_cast<uintptr_t*>(mem_array);
while (elapsed_ticks < target_ticks) {
@ -154,25 +139,6 @@ void LatencyWorker::run() {
elapsed_dummy_ticks += (stop_tick - start_tick);
p+=256;
}
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
//Time actual version of function and loop overhead
next_address = static_cast<uintptr_t*>(mem_array);
start_tick = start_timer();
for (p = 0; p < passes; p++)
(*kernel_fptr)(next_address, &next_address, len);
stop_tick = stop_timer();
elapsed_ticks += (start_tick - stop_tick);
//Time dummy version of function and loop overhead
next_address = static_cast<uintptr_t*>(_mem_array);
start_tick = start_timer();
for (p = 0; p < passes; p++)
(*kernel_dummy_fptr)(next_address, &next_address, len);
stop_tick = stop_timer();
elapsed_dummy_ticks += (start_tick - stop_tick);
#endif
adjusted_ticks = elapsed_ticks - elapsed_dummy_ticks;

Просмотреть файл

@ -49,9 +49,6 @@ using namespace xmem;
LoadWorker::LoadWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
SequentialFunction kernel_fptr,
SequentialFunction kernel_dummy_fptr,
int32_t cpu_affinity
@ -59,9 +56,6 @@ LoadWorker::LoadWorker(
MemoryWorker(
mem_array,
len,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
cpu_affinity
),
__use_sequential_kernel_fptr(true),
@ -75,9 +69,6 @@ LoadWorker::LoadWorker(
LoadWorker::LoadWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
RandomFunction kernel_fptr,
RandomFunction kernel_dummy_fptr,
int32_t cpu_affinity
@ -85,9 +76,6 @@ LoadWorker::LoadWorker(
MemoryWorker(
mem_array,
len,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
cpu_affinity
),
__use_sequential_kernel_fptr(false),
@ -121,25 +109,16 @@ void LoadWorker::run() {
tick_t elapsed_dummy_ticks = 0;
tick_t adjusted_ticks = 0;
bool warning = false;
#ifdef USE_TIME_BASED_BENCHMARKS
void* mem_array = NULL;
size_t len = 0;
tick_t target_ticks = g_ticks_per_ms * BENCHMARK_DURATION_MS; //Rough target run duration in ticks
uint32_t p = 0;
bytes_per_pass = THROUGHPUT_BENCHMARK_BYTES_PER_PASS;
#endif
//Grab relevant setup state thread-safely and keep it local
if (_acquireLock(-1)) {
#ifdef USE_TIME_BASED_BENCHMARKS
mem_array = _mem_array;
len = _len;
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
bytes_per_pass = _len;
passes = _passes_per_iteration;
#endif
cpu_affinity = _cpu_affinity;
use_sequential_kernel_fptr = __use_sequential_kernel_fptr;
kernel_fptr_seq = __kernel_fptr_seq;
@ -176,7 +155,6 @@ void LoadWorker::run() {
//Run the benchmark!
uintptr_t* next_address = static_cast<uintptr_t*>(mem_array);
#ifdef USE_TIME_BASED_BENCHMARKS
//Run actual version of function and loop overhead
while (elapsed_ticks < target_ticks) {
if (use_sequential_kernel_fptr) { //sequential function semantics
@ -221,38 +199,6 @@ void LoadWorker::run() {
elapsed_dummy_ticks += (stop_tick - start_tick);
}
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
next_address = static_cast<uintptr_t*>(mem_array);
if (use_sequential_kernel_fptr) { //sequential function semantics
start_tick = start_timer();
for (uint32_t p = 0; p < __passes_per_iteration; p++)
(*kernel_fptr_seq)(start_address, end_address);
stop_tick = stop_timer();
} else { //random function semantics
start_tick = start_timer();
for (uint32_t p = 0; p < __passes_per_iteration; p++)
(*kernel_fptr_ran)(next_address, &next_address, bytes_per_pass);
stop_tick = stop_timer();
}
elapsed_ticks = stop_tick - start_tick;
//Time dummy version of function and loop overhead
next_address = static_cast<uintptr_t*>(mem_array);
if (use_sequential_kernel_fptr) { //sequential function semantics
start_tick = start_timer();
for (uint32_t p = 0; p < __passes_per_iteration; p++)
(*kernel_dummy_fptr_seq)(start_address, end_address);
stop_tick = stop_timer();
} else { //random function semantics
start_tick = start_timer();
for (uint32_t p = 0; p < __passes_per_iteration; p++)
(*kernel_dummy_fptr_ran)(next_address, &next_address, bytes_per_pass);
stop_tick = stop_timer();
}
elapsed_dummy_ticks = stop_tick - start_tick;
#endif
//Unset processor affinity
if (locked)

Просмотреть файл

@ -36,9 +36,6 @@ using namespace xmem;
MemoryWorker::MemoryWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
int32_t cpu_affinity
) :
_mem_array(mem_array),
@ -50,9 +47,6 @@ MemoryWorker::MemoryWorker(
_elapsed_dummy_ticks(0),
_adjusted_ticks(0),
_warning(false),
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration(passes_per_iteration),
#endif
_completed(false)
{
}

Просмотреть файл

@ -1,7 +1,7 @@
README
------------------------------------------------------------------------------------------------------------
X-Mem: Extensible Memory Benchmarking Tool v2.1.13
X-Mem: Extensible Memory Benchmarking Tool v2.1.14
------------------------------------------------------------------------------------------------------------
The flexible open-source research tool for characterizing memory hierarchy throughput, latency, and power.
@ -10,7 +10,7 @@ Originally authored by Mark Gottscho (Email: <mgottscho@ucla.edu>) as a Summer 2
This project is under active development. Stay tuned for more updates.
PROJECT REVISION DATE: April 22, 2015.
PROJECT REVISION DATE: April 23, 2015.
------------------------------------------------------------------------------------------------------------
LICENSE
@ -54,9 +54,8 @@ Extensibility: modularity via C++ object-oriented principles
- Supports rapid addition of new benchmark kernel routines
- Example: stream triad algorithm, impact of false sharing, etc. are possible with minor changes
Cross-platform: Currently implemented for Windows and GNU/Linux on x86-64, x86-64 with AVX extensions CPUs
Cross-platform: Currently implemented for Windows and GNU/Linux on x86, x86-64, and x86-64 with AVX extensions CPUs
- Designed to allow straightforward porting to other operating systems and ISAs
- 32-bit x86 port under development
- ARM port under development
Memory throughput:
@ -92,9 +91,8 @@ There are a few runtime prerequisites in order for the software to run correctly
HARDWARE:
- Intel x86-64 CPU with optional support for AVX extensions. AMD CPUs should also work although this has not been tested.
- COMING SOON: Intel 32-bit x86 CPU
- COMING SOON: ARMv7 CPUs
- Intel x86 or x86-64 CPU with optional support for AVX extensions. AMD CPUs should also work although this has not been tested.
- COMING SOON: ARM CPUs
WINDOWS:
@ -109,7 +107,7 @@ WINDOWS:
GNU/LINUX:
- GNU utilities with support for C++11. Tested with gcc 4.8.2 on Ubuntu 14.04 LTS for x86-64 CPU.
- GNU utilities with support for C++11. Tested with gcc 4.8.2 on Ubuntu 14.04 LTS for x86-64 CPU with AVX.
- libhugetlbfs. You can obtain it at <http://libhugetlbfs.sourceforge.net>. On Ubuntu systems, you can install using "sudo apt-get install libhugetlbfs0".
- Potentially, administrator privileges, if you plan to use the --large_pages option.
- During runtime, if the --large_pages option is selected, you may need to first manually ensure that large pages are available from the OS. This can be done by running "hugeadm --pool-list". It is recommended to set minimum pool to 1GB (in order to measure DRAM effectively). If needed, this can be done by running "hugeadm --pool-pages-min 2MB:512". Alternatively, run the linux_setup_runtime_hugetlbfs.sh script that is provided with X-Mem.
@ -313,7 +311,7 @@ WINDOWS:
GNU/LINUX:
- gcc with support for the C++11 standard. Tested with gcc version 4.8.2 on Ubuntu 14.04 LTS for x86-64.
- gcc with support for the C++11 standard. Tested with gcc version 4.8.2 on Ubuntu 14.04 LTS for x86-64 with AVX.
- Python 2.7. You can obtain it at <http://www.python.org>. On Ubuntu systems, you can install using "sudo apt-get install python2.7". You may need some other Python 2.7 packages as well.
- SCons build system. You can obtain it at <http://www.scons.org>. On Ubuntu systems, you can install using "sudo apt-get install scons". Build tested with SCons 2.3.4.
- Kernel support for large (huge) pages. This support can be verified on your Linux installation by running "grep hugetlbfs /proc/filesystems". If you do not have huge page support in your kernel, you can build a kernel with the appropriate options switched on: "CONFIG_HUGETLB_PAGE" and "CONFIG_HUGETLBFS".

Просмотреть файл

@ -44,9 +44,6 @@ ThroughputBenchmark::ThroughputBenchmark(
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,
@ -61,9 +58,6 @@ ThroughputBenchmark::ThroughputBenchmark(
mem_array,
len,
iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
num_worker_threads,
mem_node,
cpu_node,
@ -141,18 +135,12 @@ bool ThroughputBenchmark::_run_core() {
if (_pattern_mode == SEQUENTIAL)
workers.push_back(new LoadWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
kernel_fptr_seq,
kernel_dummy_fptr_seq,
cpu_id));
else if (_pattern_mode == RANDOM)
workers.push_back(new LoadWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
kernel_fptr_ran,
kernel_dummy_fptr_ran,
cpu_id));

Просмотреть файл

@ -44,24 +44,66 @@
#include <random>
#include <algorithm>
#include <time.h>
#if defined(__gnu_linux__) && defined(ARCH_INTEL_X86_64)
#include <immintrin.h> //for Intel AVX intrinsics
#if defined(__gnu_linux__) && defined(ARCH_INTEL_X86_64) && (defined(HAS_WORD_128) || defined(HAS_WORD_256))
//Intel intrinsics
#include <emmintrin.h>
#include <immintrin.h>
#include <smmintrin.h>
#endif
#ifdef ARCH_INTEL_X86_64
#define my_32b_set_128b_word(a, b, c, d) _mm_set_epi32x(a, b, c, d)
#define my_32b_set_256b_word(a, b, c, d, e, f, g, h) _mm256_set_epi32x(a, b, c, d, e, f, g, h)
#define my_64b_set_128b_word(a, b) _mm_set_epi64x(a, b)
#define my_64b_set_256b_word(a, b, c, d) _mm256_set_epi64x(a, b, c, d)
#if defined(__gnu_linux__) && defined(ARCH_INTEL_X86_64) && (defined(HAS_WORD_128) || defined(HAS_WORD_256))
#define my_32b_set_128b_word(a, b, c, d) _mm_set_epi32(a, b, c, d) //SSE2 intrinsic, corresponds to ??? instruction. Header: emmintrin.h
#define my_32b_set_256b_word(a, b, c, d, e, f, g, h) _mm256_set_epi32(a, b, c, d, e, f, g, h) //AVX intrinsic, corresponds to ??? instruction. Header: immintrin.h
#define my_64b_set_128b_word(a, b) _mm_set_epi64x(a, b) //SSE2 intrinsic, corresponds to ??? instruction. Header: emmintrin.h
#define my_64b_set_256b_word(a, b, c, d) _mm256_set_epi64x(a, b, c, d) //AVX intrinsic, corresponds to ??? instruction. Header: immintrin.h
#define my_32b_extractLSB_128b(w) _mm_extract_epi32(w, 0)
#define my_32b_extractLSB_256b(w) _mm256_extract_epi32(w, 0)
#define my_64b_extractLSB_128b(w) _mm_extract_epi64(w, 0)
#define my_64b_extractLSB_256b(w) _mm256_extract_epi64(w, 0)
#define my_32b_extractLSB_128b(w) _mm_extract_epi32(w, 0) //SSE 4.1 intrinsic, corresponds to "pextrd" instruction. Header: smmintrin.h
#define my_32b_extractLSB_256b(w) _mm256_extract_epi32(w, 0) //AVX intrinsic, corresponds to ??? instruction. Header: immintrin.h
#define my_64b_extractLSB_128b(w) _mm_extract_epi64(w, 0) //SSE 4.1 intrinsic, corresponds to "pextrq" instruction. Header: smmintrin.h
#define my_64b_extractLSB_256b(w) _mm256_extract_epi64(w, 0) //AVX intrinsic, corresponds to ??? instruction. Header: immintrin.h
#endif
#ifdef ARCH_ARM
#error TODO: ARM intrinsics
#error TODO: ARM intrinsics?
#endif
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
/* Hand-coded assembly functions for the 128-bit and 256-bit benchmark kernels on Windows x86-64 where applicable.
* These are needed because the VC++ compiler, like gcc, optimizes away the vector instructions unless I use the volatile keyword.
* But I can't use something like volatile Word256_t* because it is incompatible with _mm_load_si256() with VC++, etc.
* Then, an alternative is to use inline assembly for Windows. However, the VC++ compiler does not support inline assembly in x86-64!
* The only remaining choice for this unfortunate circumstance is to use hand-coded assembly on Windows x86-64 builds.
* In this special case, I implemented the routine as a wrapper around the assembler function.
*/
#ifdef HAS_WORD_128
//128-bit
extern "C" int win_x86_64_asm_forwSequentialRead_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_x86_64_asm_revSequentialRead_Word128(Word128_t* last_word, Word128_t* first_word);
extern "C" int win_x86_64_asm_forwSequentialWrite_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_x86_64_asm_revSequentialWrite_Word128(Word128_t* last_word, Word128_t* first_word);
#endif
#ifdef HAS_WORD_256
//256-bit
extern "C" int win_x86_64_asm_forwSequentialRead_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_x86_64_asm_revSequentialRead_Word256(Word256_t* last_word, Word256_t* first_word);
extern "C" int win_x86_64_asm_forwSequentialWrite_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_x86_64_asm_revSequentialWrite_Word256(Word256_t* last_word, Word256_t* first_word);
#endif
//Dummies
#ifdef HAS_WORD_128
//128-bit
extern "C" int win_x86_64_asm_dummy_forwSequentialLoop_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_x86_64_asm_dummy_revSequentialLoop_Word128(Word128_t* first_word, Word128_t* last_word);
#endif
#ifdef HAS_WORD_256
//256-bit
extern "C" int win_x86_64_asm_dummy_forwSequentialLoop_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_x86_64_asm_dummy_revSequentialLoop_Word256(Word256_t* first_word, Word256_t* last_word);
#endif
#endif
using namespace xmem;
@ -588,63 +630,15 @@ bool xmem::buildRandomPointerPermutation(void* start_address, void* end_address,
std::mt19937_64 gen(time(NULL)); //Mersenne Twister random number generator, seeded at current time
/*
//TODO: probably remove this.
//Build a random directed Hamiltonian Cycle across the memory region
//Let W be the list of memory locations that have not been reached yet. Each entry is an index in mem_base.
std::vector<size_t> W;
size_t w_index = 0;
//Initialize W to contain all memory locations, where each memory location appears exactly once in the list. The order does not strictly matter.
W.resize(num_pointers);
for (w_index = 0; w_index < num_pointers; w_index++) {
W.at(w_index) = w_index;
}
//Build the directed Hamiltonian Cycle
size_t v = 0; //the current memory location. Always start at the first location for the Hamiltonian Cycle construction
size_t w = 0; //the next memory location
w_index = 0;
while (W.size() > 0) { //while we have not reached all memory locations
W.erase(W.begin() + w_index);
//Normal case
if (W.size() > 0) {
//Choose the next w_index at random from W
w_index = gen() % W.size();
//Extract the memory location corresponding to this w_index
w = W[w_index];
} else { //Last element visited needs to point at head of memory to complete the cycle
w = 0;
}
//Create pointer v --> w. This corresponds to a directed edge in the graph with nodes v and w.
mem_region_base[v] = reinterpret_cast<uintptr_t>(mem_region_base + w);
//Chase this pointer to move to next step
v = w;
}
*/
//Do a random shuffle of memory pointers
#ifndef HAS_WORD_64 //special case for 32-bit architectures
Word32_t* mem_region_base = reinterpret_cast<Word32_t*>(start_address);
#endif
//Do a random shuffle of memory pointers.
//I had originally used a random Hamiltonian Cycle generator, but this was much slower and aside from
//rare instances, did not make any difference in random-access performance measurement.
#ifdef HAS_WORD_64
Word64_t* mem_region_base = reinterpret_cast<Word64_t*>(start_address);
#else //special case for 32-bit architectures
Word32_t* mem_region_base = reinterpret_cast<Word32_t*>(start_address);
#endif
switch (chunk_size) {
//special case for 32-bit architectures
#ifndef HAS_WORD_64
case CHUNK_32b:
for (size_t i = 0; i < num_pointers; i++) { //Initialize pointers to point at themselves (identity mapping)
mem_region_base[i] = reinterpret_cast<Word32_t>(mem_region_base+i);
}
std::shuffle(mem_region_base, mem_region_base + num_pointers, gen);
break;
#endif
#ifdef HAS_WORD_64
case CHUNK_64b:
for (size_t i = 0; i < num_pointers; i++) { //Initialize pointers to point at themselves (identity mapping)
@ -652,19 +646,25 @@ bool xmem::buildRandomPointerPermutation(void* start_address, void* end_address,
}
std::shuffle(mem_region_base, mem_region_base + num_pointers, gen);
break;
#else //special case for 32-bit architectures
case CHUNK_32b:
for (size_t i = 0; i < num_pointers; i++) { //Initialize pointers to point at themselves (identity mapping)
mem_region_base[i] = reinterpret_cast<Word32_t>(mem_region_base+i);
}
std::shuffle(mem_region_base, mem_region_base + num_pointers, gen);
break;
#endif
#ifdef HAS_WORD_128
case CHUNK_128b:
for (size_t i = 0; i < num_pointers; i++) { //Initialize pointers to point at themselves (identity mapping)
#ifndef HAS_WORD_64 //special case for 32-bit architectures
#ifdef HAS_WORD_64
mem_region_base[i*2] = reinterpret_cast<Word64_t>(mem_region_base+(i*2));
mem_region_base[(i*2)+1] = 0xFFFFFFFFFFFFFFFF; //1-fill upper 64 bits
#else //special case for 32-bit architectures
mem_region_base[i*4] = reinterpret_cast<Word32_t>(mem_region_base+(i*4));
mem_region_base[(i*4)+1] = 0xFFFFFFFF; //1-fill upper 96 bits
mem_region_base[(i*4)+2] = 0xFFFFFFFF;
mem_region_base[(i*4)+3] = 0xFFFFFFFF;
#endif
#ifdef HAS_WORD_64
mem_region_base[i*2] = reinterpret_cast<Word64_t>(mem_region_base+(i*2));
mem_region_base[(i*2)+1] = 0xFFFFFFFFFFFFFFFF; //1-fill upper 64 bits
#endif
}
std::shuffle(reinterpret_cast<Word128_t*>(mem_region_base), reinterpret_cast<Word128_t*>(mem_region_base) + num_pointers, gen);
@ -673,7 +673,12 @@ bool xmem::buildRandomPointerPermutation(void* start_address, void* end_address,
#ifdef HAS_WORD_256
case CHUNK_256b:
for (size_t i = 0; i < num_pointers; i++) { //Initialize pointers to point at themselves (identity mapping)
#ifndef HAS_WORD_64 //special case for 32-bit architectures
#ifdef HAS_WORD_64
mem_region_base[i*4] = reinterpret_cast<Word64_t>(mem_region_base+(i*4));
mem_region_base[(i*4)+1] = 0xFFFFFFFFFFFFFFFF; //1-fill upper 192 bits
mem_region_base[(i*4)+2] = 0xFFFFFFFFFFFFFFFF;
mem_region_base[(i*4)+3] = 0xFFFFFFFFFFFFFFFF;
#else //special case for 32-bit architectures
mem_region_base[i*8] = reinterpret_cast<Word32_t>(mem_region_base+(i*8));
mem_region_base[(i*8)+1] = 0xFFFFFFFF; //1-fill upper 224 bits
mem_region_base[(i*8)+2] = 0xFFFFFFFF;
@ -682,12 +687,6 @@ bool xmem::buildRandomPointerPermutation(void* start_address, void* end_address,
mem_region_base[(i*8)+5] = 0xFFFFFFFF;
mem_region_base[(i*8)+6] = 0xFFFFFFFF;
mem_region_base[(i*8)+7] = 0xFFFFFFFF;
#endif
#ifdef HAS_WORD_64
mem_region_base[i*4] = reinterpret_cast<Word64_t>(mem_region_base+(i*4));
mem_region_base[(i*4)+1] = 0xFFFFFFFFFFFFFFFF; //1-fill upper 192 bits
mem_region_base[(i*4)+2] = 0xFFFFFFFFFFFFFFFF;
mem_region_base[(i*4)+3] = 0xFFFFFFFFFFFFFFFF;
#endif
}
std::shuffle(reinterpret_cast<Word256_t*>(mem_region_base), reinterpret_cast<Word256_t*>(mem_region_base) + num_pointers, gen);
@ -716,10 +715,6 @@ bool xmem::buildRandomPointerPermutation(void* start_address, void* end_address,
int32_t xmem::dummy_chasePointers(uintptr_t*, uintptr_t**, size_t len) {
volatile uintptr_t placeholder = 0; //Try to defeat compiler optimizations removing this method
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 512)
placeholder = 0;
#endif
return 0;
}
@ -727,15 +722,7 @@ int32_t xmem::dummy_chasePointers(uintptr_t*, uintptr_t**, size_t len) {
int32_t xmem::chasePointers(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
volatile uintptr_t* p = first_address;
#ifdef USE_TIME_BASED_BENCHMARKS
UNROLL512(p = reinterpret_cast<uintptr_t*>(*p);)
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 512) {
UNROLL512(p = reinterpret_cast<uintptr_t*>(*p);)
}
#endif
*last_touched_address = const_cast<uintptr_t*>(p);
return 0;
}
@ -747,45 +734,6 @@ int32_t xmem::chasePointers(uintptr_t* first_address, uintptr_t** last_touched_a
***********************************************************************
***********************************************************************/
#if defined(_WIN32) && defined(ARCH_INTEL)
//TODO: will we need ARM asm on Win?
//Hand-coded assembly functions for the SSE2/AVX benchmark routines.
//VC++ compiler does not support inline assembly in x86-64.
//And the compiler optimizes away the vector instructions unless I use volatile.
//But I can't use for example volatile Word256_t* because it is incompatible with _mm_load_si256() with VC++.
//Fortunately, I implemented the routine as a wrapper around a hand-coded assembler C function.
#ifdef HAS_WORD_128
//128-bit
extern "C" int win_asm_forwSequentialRead_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_asm_revSequentialRead_Word128(Word128_t* last_word, Word128_t* first_word);
extern "C" int win_asm_forwSequentialWrite_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_asm_revSequentialWrite_Word128(Word128_t* last_word, Word128_t* first_word);
#endif
#ifdef HAS_WORD_256
//256-bit
extern "C" int win_asm_forwSequentialRead_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_asm_revSequentialRead_Word256(Word256_t* last_word, Word256_t* first_word);
extern "C" int win_asm_forwSequentialWrite_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_asm_revSequentialWrite_Word256(Word256_t* last_word, Word256_t* first_word);
#endif
//Dummies
#ifdef HAS_WORD_128
//128-bit
extern "C" int win_asm_dummy_forwSequentialLoop_Word128(Word128_t* first_word, Word128_t* last_word);
extern "C" int win_asm_dummy_revSequentialLoop_Word128(Word128_t* first_word, Word128_t* last_word);
#endif
#ifdef HAS_WORD_256
//256-bit
extern "C" int win_asm_dummy_forwSequentialLoop_Word256(Word256_t* first_word, Word256_t* last_word);
extern "C" int win_asm_dummy_revSequentialLoop_Word256(Word256_t* first_word, Word256_t* last_word);
#endif
#endif
/* --------------------- DUMMY BENCHMARK ROUTINES --------------------------- */
int32_t xmem::dummy_empty(void*, void*) {
@ -816,8 +764,8 @@ int32_t xmem::dummy_forwSequentialLoop_Word64(void* start_address, void* end_add
#ifdef HAS_WORD_128
int32_t xmem::dummy_forwSequentialLoop_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_dummy_forwSequentialLoop_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_dummy_forwSequentialLoop_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#endif
#ifdef __gnu_linux__
volatile int32_t placeholder = 0; //Try our best to defeat compiler optimizations
@ -832,8 +780,8 @@ int32_t xmem::dummy_forwSequentialLoop_Word128(void* start_address, void* end_ad
#ifdef HAS_WORD_256
int32_t xmem::dummy_forwSequentialLoop_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_dummy_forwSequentialLoop_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_dummy_forwSequentialLoop_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#endif
#ifdef __gnu_linux__
volatile int32_t placeholder = 0; //Try our best to defeat compiler optimizations
@ -868,8 +816,8 @@ int32_t xmem::dummy_revSequentialLoop_Word64(void* start_address, void* end_addr
#ifdef HAS_WORD_128
int32_t xmem::dummy_revSequentialLoop_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_dummy_revSequentialLoop_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_dummy_revSequentialLoop_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#endif
volatile int32_t placeholder = 0; //Try our best to defeat compiler optimizations
for (volatile Word128_t* wordptr = static_cast<Word128_t*>(end_address), *begptr = static_cast<Word128_t*>(start_address); wordptr > begptr;) {
@ -882,8 +830,8 @@ int32_t xmem::dummy_revSequentialLoop_Word128(void* start_address, void* end_add
#ifdef HAS_WORD_256
int32_t xmem::dummy_revSequentialLoop_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_dummy_revSequentialLoop_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_dummy_revSequentialLoop_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#endif
#ifdef __gnu_linux__
volatile int32_t placeholder = 0; //Try our best to defeat compiler optimizations
@ -1341,12 +1289,6 @@ int32_t xmem::dummy_revStride16Loop_Word256(void* start_address, void* end_addre
#ifndef HAS_WORD_64 //special case: 32-bit architectures
int32_t xmem::dummy_randomLoop_Word32(uintptr_t*, uintptr_t**, size_t len) {
volatile uintptr_t* placeholder = NULL; //Try to defeat compiler optimizations removing this method
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 1024)
placeholder = NULL;
#endif
return 0;
}
#endif
@ -1354,19 +1296,13 @@ int32_t xmem::dummy_randomLoop_Word32(uintptr_t*, uintptr_t**, size_t len) {
#ifdef HAS_WORD_64
int32_t xmem::dummy_randomLoop_Word64(uintptr_t*, uintptr_t**, size_t len) {
volatile uintptr_t* placeholder = NULL; //Try to defeat compiler optimizations removing this method
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 512)
placeholder = NULL;
#endif
return 0;
}
#endif
#ifdef HAS_WORD_128
int32_t xmem::dummy_randomLoop_Word128(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1376,24 +1312,12 @@ int32_t xmem::dummy_randomLoop_Word128(uintptr_t* first_address, uintptr_t** las
volatile uintptr_t val_extract;
val = my_64b_set_128b_word(0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF);
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machines
UNROLL256(val_extract = my_32b_extractLSB_128b(val);) //Extract 32 LSB.
#endif
#ifdef HAS_WORD_64
UNROLL256(val_extract = my_64b_extractLSB_128b(val);) //Extract 64 LSB.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word128_t); i += 256) {
#ifndef HAS_WORD_64 //special case: 32-bit machines
UNROLL256(val_extract = my_32b_extractLSB_128b(val);) //Extract 32 LSB.
#endif
#ifdef HAS_WORD_64
UNROLL256(val_extract = my_64b_extractLSB_128b(val);) //Extract 64 LSB.
#endif
}
#endif
return 0;
#endif
@ -1402,7 +1326,7 @@ int32_t xmem::dummy_randomLoop_Word128(uintptr_t* first_address, uintptr_t** las
#ifdef HAS_WORD_256
int32_t xmem::dummy_randomLoop_Word256(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1412,24 +1336,12 @@ int32_t xmem::dummy_randomLoop_Word256(uintptr_t* first_address, uintptr_t** las
volatile uintptr_t val_extract;
val = my_64b_set_256b_word(0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF);
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machines
UNROLL128(val_extract = my_32b_extractLSB_256b(val);) //Extract 32 LSB.
#endif
#ifdef HAS_WORD_64
UNROLL128(val_extract = my_64b_extractLSB_256b(val);) //Extract 64 LSB.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word256_t); i += 128) {
#ifndef HAS_WORD_64 //special case: 32-bit machines
UNROLL128(val_extract = my_32b_extractLSB_256b(val);) //Extract 32 LSB.
#endif
#ifdef HAS_WORD_64
UNROLL128(val_extract = my_64b_extractLSB_256b(val);) //Extract 64 LSB.
#endif
}
#endif
return 0;
#endif
@ -1472,8 +1384,8 @@ int32_t xmem::forwSequentialRead_Word64(void* start_address, void* end_address)
#ifdef HAS_WORD_128
int32_t xmem::forwSequentialRead_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_forwSequentialRead_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_forwSequentialRead_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#endif
#ifdef __gnu_linux__
register Word128_t val;
@ -1487,8 +1399,8 @@ int32_t xmem::forwSequentialRead_Word128(void* start_address, void* end_address)
#ifdef HAS_WORD_256
int32_t xmem::forwSequentialRead_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_forwSequentialRead_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_forwSequentialRead_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#endif
#ifdef __gnu_linux__
register Word256_t val;
@ -1520,8 +1432,8 @@ int32_t xmem::revSequentialRead_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revSequentialRead_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_revSequentialRead_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_revSequentialRead_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#endif
#ifdef __gnu_linux__
register Word128_t val;
@ -1535,8 +1447,8 @@ int32_t xmem::revSequentialRead_Word128(void* start_address, void* end_address)
#ifdef HAS_WORD_256
int32_t xmem::revSequentialRead_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_revSequentialRead_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_revSequentialRead_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#endif
#ifdef __gnu_linux__
register Word256_t val;
@ -1570,8 +1482,8 @@ int32_t xmem::forwSequentialWrite_Word64(void* start_address, void* end_address)
#ifdef HAS_WORD_128
int32_t xmem::forwSequentialWrite_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_forwSequentialWrite_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_forwSequentialWrite_Word128(static_cast<Word128_t*>(start_address), static_cast<Word128_t*>(end_address));
#endif
#ifdef __gnu_linux__
register Word128_t val;
@ -1586,8 +1498,8 @@ int32_t xmem::forwSequentialWrite_Word128(void* start_address, void* end_address
#ifdef HAS_WORD_256
int32_t xmem::forwSequentialWrite_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_forwSequentialWrite_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_forwSequentialWrite_Word256(static_cast<Word256_t*>(start_address), static_cast<Word256_t*>(end_address));
#endif
#ifdef __gnu_linux__
register Word256_t val;
@ -1620,8 +1532,8 @@ int32_t xmem::revSequentialWrite_Word64(void* start_address, void* end_address)
#ifdef HAS_WORD_128
int32_t xmem::revSequentialWrite_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_revSequentialWrite_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_revSequentialWrite_Word128(static_cast<Word128_t*>(end_address), static_cast<Word128_t*>(start_address));
#endif
#ifdef __gnu_linux__
register Word128_t val;
@ -1636,8 +1548,8 @@ int32_t xmem::revSequentialWrite_Word128(void* start_address, void* end_address)
#ifdef HAS_WORD_256
int32_t xmem::revSequentialWrite_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
return win_asm_revSequentialWrite_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return win_x86_64_asm_revSequentialWrite_Word256(static_cast<Word256_t*>(end_address), static_cast<Word256_t*>(start_address));
#endif
#ifdef __gnu_linux__
register Word256_t val;
@ -1680,7 +1592,7 @@ int32_t xmem::forwStride2Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride2Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1699,7 +1611,7 @@ int32_t xmem::forwStride2Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride2Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1744,7 +1656,7 @@ int32_t xmem::revStride2Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride2Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1763,7 +1675,7 @@ int32_t xmem::revStride2Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride2Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1810,7 +1722,7 @@ int32_t xmem::forwStride2Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride2Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1830,7 +1742,7 @@ int32_t xmem::forwStride2Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride2Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1876,7 +1788,7 @@ int32_t xmem::revStride2Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride2Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1896,7 +1808,7 @@ int32_t xmem::revStride2Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride2Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1944,7 +1856,7 @@ int32_t xmem::forwStride4Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride4Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -1963,7 +1875,7 @@ int32_t xmem::forwStride4Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride4Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2008,7 +1920,7 @@ int32_t xmem::revStride4Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride4Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2027,7 +1939,7 @@ int32_t xmem::revStride4Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride4Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2074,7 +1986,7 @@ int32_t xmem::forwStride4Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride4Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2094,7 +2006,7 @@ int32_t xmem::forwStride4Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride4Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2140,7 +2052,7 @@ int32_t xmem::revStride4Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride4Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2160,7 +2072,7 @@ int32_t xmem::revStride4Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride4Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2208,7 +2120,7 @@ int32_t xmem::forwStride8Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride8Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2227,7 +2139,7 @@ int32_t xmem::forwStride8Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride8Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2272,7 +2184,7 @@ int32_t xmem::revStride8Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride8Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2291,7 +2203,7 @@ int32_t xmem::revStride8Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride8Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2338,7 +2250,7 @@ int32_t xmem::forwStride8Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride8Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2358,7 +2270,7 @@ int32_t xmem::forwStride8Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride8Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2404,7 +2316,7 @@ int32_t xmem::revStride8Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride8Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2424,7 +2336,7 @@ int32_t xmem::revStride8Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride8Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2472,7 +2384,7 @@ int32_t xmem::forwStride16Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride16Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2491,7 +2403,7 @@ int32_t xmem::forwStride16Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::forwStride16Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2536,7 +2448,7 @@ int32_t xmem::revStride16Read_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride16Read_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2555,7 +2467,7 @@ int32_t xmem::revStride16Read_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride16Read_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2602,7 +2514,7 @@ int32_t xmem::forwStride16Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::forwStride16Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0;
#endif
#ifdef __gnu_linux__
@ -2622,7 +2534,7 @@ int32_t xmem::forwStride16Write_Word128(void* start_address, void* end_address)
#ifdef HAS_WORD_256
int32_t xmem::forwStride16Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0;
#endif
#ifdef __gnu_linux__
@ -2670,7 +2582,7 @@ int32_t xmem::revStride16Write_Word64(void* start_address, void* end_address) {
#ifdef HAS_WORD_128
int32_t xmem::revStride16Write_Word128(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2691,7 +2603,7 @@ int32_t xmem::revStride16Write_Word128(void* start_address, void* end_address) {
#ifdef HAS_WORD_256
int32_t xmem::revStride16Write_Word256(void* start_address, void* end_address) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
@ -2716,14 +2628,7 @@ int32_t xmem::revStride16Write_Word256(void* start_address, void* end_address) {
int32_t xmem::randomRead_Word32(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
volatile uintptr_t* p = first_address;
#ifdef USE_TIME_BASED_BENCHMARKS
UNROLL1024(p = reinterpret_cast<uintptr_t*>(*p);)
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 1024) {
UNROLL1024(p = reinterpret_cast<uintptr_t*>(*p);)
}
#endif
*last_touched_address = const_cast<uintptr_t*>(p);
return 0;
}
@ -2733,14 +2638,7 @@ int32_t xmem::randomRead_Word32(uintptr_t* first_address, uintptr_t** last_touch
int32_t xmem::randomRead_Word64(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
volatile uintptr_t* p = first_address;
#ifdef USE_TIME_BASED_BENCHMARKS
UNROLL512(p = reinterpret_cast<uintptr_t*>(*p);)
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 512) {
UNROLL512(p = reinterpret_cast<uintptr_t*>(*p);)
}
#endif
*last_touched_address = const_cast<uintptr_t*>(p);
return 0;
}
@ -2748,31 +2646,19 @@ int32_t xmem::randomRead_Word64(uintptr_t* first_address, uintptr_t** last_touch
#ifdef HAS_WORD_128
int32_t xmem::randomRead_Word128(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
volatile Word128_t* p = reinterpret_cast<Word128_t*>(first_address);
register Word128_t val;
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL256(val = *p; p = reinterpret_cast<Word128_t*>(my_32b_extractLSB_128b(val));) //Do 128-bit load. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL256(val = *p; p = reinterpret_cast<Word128_t*>(my_64b_extractLSB_128b(val));) //Do 128-bit load. Then extract 64 LSB to use as next load address.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word128_t); i += 256) {
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL256(val = *p; p = reinterpret_cast<Word128_t*>(my_32b_extractLSB_128b(val));) //Do 128-bit load. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL256(val = *p; p = reinterpret_cast<Word128_t*>(my_64b_extractLSB_128b(val));) //Do 128-bit load. Then extract 64 LSB to use as next load address.
#endif
}
#endif
*last_touched_address = reinterpret_cast<uintptr_t*>(const_cast<Word128_t*>(p)); //Trick compiler. First get rid of volatile qualifier, and then reinterpret pointer
return 0;
@ -2782,31 +2668,19 @@ int32_t xmem::randomRead_Word128(uintptr_t* first_address, uintptr_t** last_touc
#ifdef HAS_WORD_256
int32_t xmem::randomRead_Word256(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
volatile Word256_t* p = reinterpret_cast<Word256_t*>(first_address);
register Word256_t val;
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL128(val = *p; p = reinterpret_cast<Word256_t*>(my_32b_extractLSB_256b(val));) //Do 256-bit load. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL128(val = *p; p = reinterpret_cast<Word256_t*>(my_64b_extractLSB_256b(val));) //Do 256-bit load. Then extract 64 LSB to use as next load address.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word256_t); i += 128) {
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL128(val = *p; p = reinterpret_cast<Word256_t*>(my_32b_extractLSB_256b(val));) //Do 256-bit load. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL128(val = *p; p = reinterpret_cast<Word256_t*>(my_64b_extractLSB_256b(val));) //Do 256-bit load. Then extract 64 LSB to use as next load address.
#endif
}
#endif
*last_touched_address = reinterpret_cast<uintptr_t*>(const_cast<Word256_t*>(p)); //Trick compiler. First get rid of volatile qualifier, and then reinterpret pointer
return 0;
@ -2821,14 +2695,7 @@ int32_t xmem::randomWrite_Word32(uintptr_t* first_address, uintptr_t** last_touc
volatile uintptr_t* p = first_address;
volatile uintptr_t* p2 = NULL;
#ifdef USE_TIME_BASED_BENCHMARKS
UNROLL1024(p2 = reinterpret_cast<uintptr_t*>(*p); *p = reinterpret_cast<uintptr_t>(p2); p = p2;)
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 1024) {
UNROLL1024(p2 = reinterpret_cast<uintptr_t*>(*p); *p = reinterpret_cast<uintptr_t>(p2); p = p2;)
}
#endif
*last_touched_address = const_cast<uintptr_t*>(p);
return 0;
}
@ -2839,14 +2706,7 @@ int32_t xmem::randomWrite_Word64(uintptr_t* first_address, uintptr_t** last_touc
volatile uintptr_t* p = first_address;
volatile uintptr_t* p2 = NULL;
#ifdef USE_TIME_BASED_BENCHMARKS
UNROLL512(p2 = reinterpret_cast<uintptr_t*>(*p); *p = reinterpret_cast<uintptr_t>(p2); p = p2;)
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(uintptr_t); i += 512) {
UNROLL512(p2 = reinterpret_cast<uintptr_t*>(*p); *p = reinterpret_cast<uintptr_t>(p2); p = p2;)
}
#endif
*last_touched_address = const_cast<uintptr_t*>(p);
return 0;
}
@ -2854,31 +2714,19 @@ int32_t xmem::randomWrite_Word64(uintptr_t* first_address, uintptr_t** last_touc
#ifdef HAS_WORD_128
int32_t xmem::randomWrite_Word128(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
volatile Word128_t* p = reinterpret_cast<Word128_t*>(first_address);
register Word128_t val;
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL256(val = *p; *p = val; p = reinterpret_cast<Word128_t*>(my_32b_extractLSB_128b(val));) //Do 128-bit load. Then do 128-bit store. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL256(val = *p; *p = val; p = reinterpret_cast<Word128_t*>(my_64b_extractLSB_128b(val));) //Do 128-bit load. Then do 128-bit store. Then extract 64 LSB to use as next load address.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word128_t); i += 256) {
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL256(val = *p; *p = val; p = reinterpret_cast<Word128_t*>(my_32b_extractLSB_128b(val));) //Do 128-bit load. Then do 128-bit store. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL256(val = *p; *p = val; p = reinterpret_cast<Word128_t*>(my_64b_extractLSB_128b(val));) //Do 128-bit load. Then do 128-bit store. Then extract 64 LSB to use as next load address.
#endif
}
#endif
*last_touched_address = reinterpret_cast<uintptr_t*>(const_cast<Word128_t*>(p)); //Trick compiler. First get rid of volatile qualifier, and then reinterpret pointer
return 0;
@ -2888,31 +2736,19 @@ int32_t xmem::randomWrite_Word128(uintptr_t* first_address, uintptr_t** last_tou
#ifdef HAS_WORD_256
int32_t xmem::randomWrite_Word256(uintptr_t* first_address, uintptr_t** last_touched_address, size_t len) {
#ifdef _WIN32
#if defined(_WIN32) && defined(ARCH_INTEL_X86_64)
return 0; //TODO: Implement for Windows.
#endif
#ifdef __gnu_linux__
volatile Word256_t* p = reinterpret_cast<Word256_t*>(first_address);
register Word256_t val;
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL128(val = *p; *p = val; p = reinterpret_cast<Word256_t*>(my_32b_extractLSB_256b(val));) //Do 256-bit load. Then do 256-bit store. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL128(val = *p; *p = val; p = reinterpret_cast<Word256_t*>(my_64b_extractLSB_256b(val));) //Do 256-bit load. Then do 256-bit store. Then extract 64 LSB to use as next load address.
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
for (size_t i = 0; i < len / sizeof(Word256_t); i += 128) {
#ifndef HAS_WORD_64 //special case: 32-bit machine
UNROLL128(val = *p; *p = val; p = reinterpret_cast<Word256_t*>(my_32b_extractLSB_256b(val));) //Do 256-bit load. Then do 256-bit store. Then extract 32 LSB to use as next load address.
#endif
#ifdef HAS_WORD_64
UNROLL128(val = *p; *p = val; p = reinterpret_cast<Word256_t*>(my_64b_extractLSB_256b(val));) //Do 256-bit load. Then do 256-bit store. Then extract 64 LSB to use as next load address.
#endif
}
#endif
*last_touched_address = reinterpret_cast<uintptr_t*>(const_cast<Word256_t*>(p)); //Trick compiler. First get rid of volatile qualifier, and then reinterpret pointer
return 0;

Просмотреть файл

@ -147,9 +147,15 @@ void xmem::print_compile_time_options() {
#ifdef ARCH_INTEL_X86_64
std::cout << "ARCH_INTEL_X86_64" << std::endl;
#endif
#ifdef ARCH_INTEL_X86_64_SSE
std::cout << "ARCH_INTEL_X86_64_SSE" << std::endl;
#endif
#ifdef ARCH_INTEL_X86_64_SSE2
std::cout << "ARCH_INTEL_X86_64_SSE2" << std::endl;
#endif
#ifdef ARCH_INTEL_X86_64_SSE3
std::cout << "ARCH_INTEL_X86_64_SSE3" << std::endl;
#endif
#ifdef ARCH_INTEL_X86_64_AVX
std::cout << "ARCH_INTEL_X86_64_AVX" << std::endl;
#endif
@ -162,6 +168,24 @@ void xmem::print_compile_time_options() {
#ifdef ARCH_ARM
std::cout << "ARCH_ARM" << std::endl;
#endif
#ifdef ARCH_ARM_64
std::cout << "ARCH_ARM_64" << std::endl;
#endif
#ifdef ARCH_ARM_V7
std::cout << "ARCH_ARM_V7" << std::endl;
#endif
#ifdef ARCH_ARM_V8
std::cout << "ARCH_ARM_V8" << std::endl;
#endif
#ifdef ARCH_ARM_VFP_V3
std::cout << "ARCH_ARM_VFP_V3" << std::endl;
#endif
#ifdef ARCH_ARM_VFP_V4
std::cout << "ARCH_ARM_VFP_V4" << std::endl;
#endif
#ifdef ARCH_ARM_NEON
std::cout << "ARCH_ARM_NEON" << std::endl;
#endif
#ifdef ARCH_64BIT
std::cout << "ARCH_64BIT" << std::endl;
#endif
@ -198,24 +222,12 @@ void xmem::print_compile_time_options() {
std::cout << "USE_TSC_TIMER" << std::endl;
#endif
//TODO: ARM timer
#ifdef USE_TIME_BASED_BENCHMARKS
std::cout << "USE_TIME_BASED_BENCHMARKS" << std::endl;
#endif
#ifdef BENCHMARK_DURATION_SEC
std::cout << "BENCHMARK_DURATION_SEC = " << BENCHMARK_DURATION_SEC << std::endl; //This must be defined
#endif
#ifdef THROUGHPUT_BENCHMARK_BYTES_PER_PASS
std::cout << "THROUGHPUT_BENCHMARK_BYTES_PER_PASS == " << THROUGHPUT_BENCHMARK_BYTES_PER_PASS << std::endl;
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
std::cout << "USE_SIZE_BASED_BENCHMARKS" << std::endl;
#endif
#ifdef USE_PASSES_CURVE_1
std::cout << "USE_PASSES_CURVE_1" << std::endl;
#endif
#ifdef USE_PASSES_CURVE_2
std::cout << "USE_PASSES_CURVE_2" << std::endl;
#endif
#ifdef POWER_SAMPLING_PERIOD_SEC
std::cout << "POWER_SAMPLING_PERIOD_MS == " << POWER_SAMPLING_PERIOD_MS << std::endl;
#endif
@ -379,20 +391,6 @@ int32_t xmem::cpu_id_in_numa_node(uint32_t numa_node, uint32_t cpu_in_node) {
#endif
}
size_t xmem::compute_number_of_passes(size_t working_set_size_KB) {
size_t passes = 0;
#ifdef USE_PASSES_CURVE_1
passes = 65536 / working_set_size_KB;
#else
#ifdef USE_PASSES_CURVE_2
passes = (4*2097152) / working_set_size_KB^2;
#endif
#endif
if (passes < 1)
passes = 1;
return passes;
}
void xmem::init_globals() {
//Initialize global variables to defaults.
g_verbose = false;

Просмотреть файл

@ -56,9 +56,6 @@ DelayInjectedLoadedLatencyBenchmark::DelayInjectedLoadedLatencyBenchmark(
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,
@ -71,9 +68,6 @@ DelayInjectedLoadedLatencyBenchmark::DelayInjectedLoadedLatencyBenchmark(
mem_array,
len,
iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
passes_per_iteration,
#endif
num_worker_threads,
mem_node,
cpu_node,
@ -277,18 +271,12 @@ bool DelayInjectedLoadedLatencyBenchmark::_run_core() {
if (t == 0) { //special case: thread 0 is always latency thread
workers.push_back(new LatencyWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
lat_kernel_fptr,
lat_kernel_dummy_fptr,
cpu_id));
} else {
workers.push_back(new LoadWorker(thread_mem_array,
len_per_thread,
#ifdef USE_SIZE_BASED_BENCHMARKS
_passes_per_iteration,
#endif
load_kernel_fptr,
load_kernel_dummy_fptr,
cpu_id));

Просмотреть файл

@ -61,7 +61,6 @@ namespace xmem {
* @param mem_array A pointer to a contiguous chunk of memory that has been allocated for benchmarking among potentially several worker threads. This should be aligned to a 256-bit boundary.
* @param len Length of mem_array in bytes. This must be a multiple of 4 KB and should be at least the per-thread working set size times the number of worker threads.
* @param iterations Number of iterations of the complete benchmark. Used to average results and provide a measure of consistency and reproducibility.
* @param passes_per_iteration Number of passes to do in each iteration, to ensure timed section of code is "long enough".
* @param num_worker_threads The number of worker threads to use in the benchmark.
* @param mem_node The logical memory NUMA node used in the benchmark.
* @param cpu_node The logical CPU NUMA node to use for the benchmark.
@ -76,9 +75,6 @@ namespace xmem {
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,
@ -172,14 +168,6 @@ namespace xmem {
*/
uint32_t getIterations() const;
#ifdef USE_SIZE_BASED_BENCHMARKS
/**
* @brief Gets the number of passes in each iteration.
* @returns The number of passes per iteration for this benchmark.
*/
uint32_t getPassesPerIteration() const;
#endif
/**
* @brief Gets the width of memory access used in this benchmark.
* @returns The chunk size for this benchmark.
@ -256,9 +244,6 @@ namespace xmem {
//Benchmark repetition
uint32_t _iterations; /**< Number of iterations used in this benchmark. */
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t _passes_per_iteration; /**< Number of passes per iteration in this benchmark. */
#endif
//Threading and affinity
uint32_t _num_worker_threads; /**< The number of worker threads used in this benchmark. */

Просмотреть файл

@ -53,9 +53,6 @@ namespace xmem {
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,

Просмотреть файл

@ -53,9 +53,6 @@ namespace xmem {
LatencyWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
RandomFunction kernel_fptr,
RandomFunction kernel_dummy_fptr,
int32_t cpu_affinity

Просмотреть файл

@ -52,9 +52,6 @@ namespace xmem {
LoadWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
SequentialFunction kernel_fptr,
SequentialFunction kernel_dummy_fptr,
int32_t cpu_affinity
@ -71,9 +68,6 @@ namespace xmem {
LoadWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
RandomFunction kernel_fptr,
RandomFunction kernel_dummy_fptr,
int32_t cpu_affinity

Просмотреть файл

@ -48,15 +48,11 @@ namespace xmem {
* @brief Constructor.
* @param mem_array Pointer to the memory region to use by this worker.
* @param len Length of the memory region to use by this worker.
* @param passes_per_iteration for size-based benchmarking, this is the number of passes to execute in a single benchmark iteration.
* @param cpu_affinity Logical CPU identifier to lock this worker's thread to.
*/
MemoryWorker(
void* mem_array,
size_t len,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
int32_t cpu_affinity
);
@ -124,9 +120,6 @@ namespace xmem {
tick_t _adjusted_ticks; /**< Elapsed ticks minus dummy elapsed ticks. */
bool _warning; /**< If true, results may be suspect. */
bool _completed; /**< If true, worker completed. */
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t _passes_per_iteration; /**< Number of passes per iteration. */
#endif
};
};

Просмотреть файл

@ -52,9 +52,6 @@ namespace xmem {
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,

Просмотреть файл

@ -45,7 +45,7 @@
namespace xmem {
#define VERSION "2.1.13"
#define VERSION "2.1.14"
#if !defined(_WIN32) && !defined(__gnu_linux__)
#error Neither Windows/GNULinux build environments were detected!
@ -55,43 +55,58 @@ namespace xmem {
#ifdef _WIN32
#ifdef _M_IX86 //Intel x86
#define ARCH_INTEL_X86
#define ARCH_INTEL
#define ARCH_INTEL_X86
#endif
#ifdef _M_X64 //Intel x86-64
#define ARCH_INTEL_X86_64
#define ARCH_INTEL
#define ARCH_INTEL_X86_64
#define ARCH_64BIT
#define HAS_NUMA
#endif
#ifdef _M_IX86_FP //Intel x86-64 SSE2 extensions
#define ARCH_INTEL_X86_64_SSE2
#define ARCH_INTEL
#if _M_IX86_FP == 1
#define ARCH_INTEL_X86_64_SSE
#endif
#if _M_IX86_FP == 2
#define ARCH_INTEL_X86_64_SSE
#define ARCH_INTEL_X86_64_SSE2
#endif
#endif
#ifdef __AVX__ //Intel x86-64 AVX extensions
#define ARCH_INTEL_X86_64_AVX
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_AVX
#endif
#ifdef __AVX2__ //Intel x86-64 AVX2 extensions
#define ARCH_INTEL_X86_64_AVX2
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_AVX2
#endif
#ifdef _AMD64 //AMD64
#ifdef _M_AMD64 //AMD64
#define ARCH_INTEL
#define ARCH_AMD64
#define ARCH_64BIT
#define ARCH_INTEL
#endif
#ifdef _M_ARM //ARM architecture
#define ARCH_ARM
#define ARCH_ARM_NEON //FIXME: I don't think there is a way to explicitly check for NEON support on Windows, so I suppose it is always present on any Windows-supported ARM platform anyway.
#endif
//TODO: ARM 64-bit support
#ifdef _M_ARM_FP //ARM extensions
#if _M_ARM_FP >= 30 && _M_ARM_FP <= 39
#define ARCH_ARM_VFP_V3
#endif
#if _M_ARM_FP >= 40 && _M_ARM_FP <= 49
#define ARCH_ARM_VFP_V3
#define ARCH_ARM_VFP_V4
#endif
#endif
#endif
@ -99,42 +114,78 @@ namespace xmem {
#ifdef __gnu_linux__
#ifdef __i386__ //Intel x86
#define ARCH_INTEL_X86
#define ARCH_INTEL
#define ARCH_INTEL_X86
#endif
#ifdef __x86_64__ //Intel x86-64
#define ARCH_INTEL
#define ARCH_INTEL_X86_64
#define ARCH_64BIT
#define ARCH_INTEL
#define HAS_NUMA
#endif
#ifdef __SSE2__ //Intel x86-64 SSE2 extensions
#define ARCH_INTEL_X86_64_SSE2
#ifdef __SSE__ //Intel x86-64 SSE extensions
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_SSE
#endif
#ifdef __SSE2__ //Intel x86-64 SSE2 extensions
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_SSE2
#endif
#ifdef __SSE3__ //Intel x86-64 SSE3 extensions
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_SSE3
#endif
#ifdef __AVX__ //Intel x86-64 AVX extensions
#define ARCH_INTEL_X86_64_AVX
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_AVX
#endif
#ifdef __AVX2__ //Intel x86-64 AVX2 extensions
#define ARCH_INTEL_X86_64_AVX2
#define ARCH_INTEL
#define ARCH_INTEL_X86_64_AVX2
#endif
#ifdef __amd64__ //AMD64
#define ARCH_INTEL
#define ARCH_AMD64
#define ARCH_64BIT
#endif
#ifdef __arm__ //ARM architecture
#define ARCH_ARM
#define ARCH_ARM_VFP_V3 //FIXME: this is assumed, as I don't know how to check directly
#endif
//TODO: ARM 64-bit support
#ifdef __aarch64__ //ARM 64-bit
#define ARCH_ARM
#define ARCH_ARM_64
#define ARCH_ARM_VFP_V4 //FIXME: this is assumed, as I don't know how to check directly
#define ARCH_64BIT
#endif
#ifdef __ARM_ARCH_7__ //ARMv7
#define ARCH_ARM
#define ARCH_ARM_V7
#endif
#ifdef __ARM_ARCH_8__ //ARMv8
#define ARCH_ARM
#define ARCH_ARM_V8
#define ARCH_ARM_64
#define ARCH_ARM_VFP_V4 //FIXME: this is assumed, as I don't know how to check directly
#define ARCH_64BIT
#endif
#ifdef __ARM_NEON__ //ARM NEON extensions
#define ARCH_ARM
#define ARCH_ARM_NEON
#define ARCH_ARM_VFP_V4 //FIXME: this is assumed, as I don't know how to check directly
#endif
#endif
@ -204,19 +255,8 @@ namespace xmem {
#define USE_OS_TIMER /**< RECOMMENDED ENABLED. If enabled, uses the QPC timer on Windows and the POSIX clock_gettime() on GNU/Linux for all timing purposes. */
//#define USE_HW_TIMER /**< RECOMMENDED DISABLED. If enabled, uses the platform-specific hardware timer (e.g., TSC on Intel x86-64). This may be less portable or have other implementation-specific quirks but for most purposes should work fine. */
//Benchmarking methodology. Only one may be selected!
#define USE_TIME_BASED_BENCHMARKS /**< RECOMMENDED ENABLED. All benchmarks run for an estimated amount of time, and the figures of merit are computed based on the amount of memory accesses completed in the time limit. This mode has more consistent runtime across different machines, memory performance, and working set sizes, but may have more conservative measurements for differing levels of cache hierarchy (overestimating latency and underestimating throughput). */
//#define USE_SIZE_BASED_BENCHMARKS /**< RECOMMENDED DISABLED. All benchmarks run for an estimated amount of memory accesses, and the figures of merit are computed based on the length of time required to run the benchmark. This mode may have highly varying runtime across different machines, memory performance, and working set sizes, but may have more optimistic measurements across differing levels of cache hierarchy (underestimating latency and overestimating throughput). TODO: remove this feature entirely at some point, it just complicates things... */
#ifdef USE_TIME_BASED_BENCHMARKS //DO NOT COMMENT THIS OUT!
#define BENCHMARK_DURATION_MS 250 /**< RECOMMENDED VALUE: At least 1000. Number of milliseconds to run in each benchmark. */
#define THROUGHPUT_BENCHMARK_BYTES_PER_PASS 4096 /**< RECOMMENDED VALUE: 4096. Number of bytes read or written per pass of any ThroughputBenchmark. This must be less than or equal to the minimum working set size, which is currently 4 KB. */
#endif //DO NOT COMMENT THIS OUT
#ifdef USE_SIZE_BASED_BENCHMARKS //DO NOT COMMENT THIS OUT
//#define USE_PASSES_CURVE_1 /**< RECOMMENDED DISABLED. The passes per iteration of a benchmark will be given by y = 65536 / working_set_size_KB */
#define USE_PASSES_CURVE_2 /**< RECOMMENDED ENABLED. The passes per iteration of a benchmark will be given by y = 4*2097152 / working_set_size_KB^2 */
#endif //DO NOT COMMENT THIS OUT
#define POWER_SAMPLING_PERIOD_MS 1000 /**< RECOMMENDED VALUE: 1000. Sampling period in milliseconds for all power measurement mechanisms. */
@ -254,32 +294,13 @@ namespace xmem {
#error Only one type of timer may be defined!
#endif
//Compile-time options checks: benchmarking type: size or time-limited
#if (defined(USE_TIME_BASED_BENCHMARKS) && defined(USE_SIZE_BASED_BENCHMARKS)) || (!defined(USE_TIME_BASED_BENCHMARKS) && !defined(USE_SIZE_BASED_BENCHMARKS))
#error Exactly one of USE_TIME_BASED_BENCHMARKS and USE_SIZE_BASED_BENCHMARKS must be defined!
#endif
#ifdef USE_TIME_BASED_BENCHMARKS
#ifndef BENCHMARK_DURATION_MS
#error BENCHMARK_DURATION_MS must be defined!
#endif
#if BENCHMARK_DURATION_MS <= 0
#error BENCHMARK_DURATION_MS must be positive!
#endif
#ifndef THROUGHPUT_BENCHMARK_BYTES_PER_PASS
#error THROUGHPUT_BENCHMARK_BYTES_PER_PASS must be defined!
#else
#if THROUGHPUT_BENCHMARK_BYTES_PER_PASS > DEFAULT_PAGE_SIZE || THROUGHPUT_BENCHMARK_BYTES_PER_PASS <= 0
#error THROUGHPUT_BENCHMARK_BYTES_PER_PASS must be less than or equal to the minimum possible working set size. It also must be a positive integer.
#endif
#endif
#endif
#ifdef USE_SIZE_BASED_BENCHMARKS
#if (defined(USE_PASSES_CURVE_1) && defined(USE_PASSES_CURVE_2)) || (!defined(USE_PASSES_CURVE_1) && !defined(USE_PASSES_CURVE_2))
#error Exactly one passes curve must be defined.
#endif
#endif
//Compile-time options checks: power sampling frequency. TODO: this should probably be a runtime option
#if !defined(POWER_SAMPLING_PERIOD_MS) || POWER_SAMPLING_PERIOD_MS <= 0

Просмотреть файл

@ -55,9 +55,6 @@ namespace xmem {
void* mem_array,
size_t len,
uint32_t iterations,
#ifdef USE_SIZE_BASED_BENCHMARKS
uint32_t passes_per_iteration,
#endif
uint32_t num_worker_threads,
uint32_t mem_node,
uint32_t cpu_node,