We're rolling out a new major update to AIDA64 in a few weeks. It will feature the usual improvements to support the latest and greatest hardware technologies, such as GPU details for AMD Radeon R5, R7 and R9 Series and nVIDIA GeForce GTX 760 Ti OEM, and optimized benchmarks for AMD Kaveri and Intel Bay Trail.
But most importantly, we're introducing a brand new benchmark panel that offers a set of OpenCL GPGPU benchmarks that you can launch from AIDA64 / main menu / Tools / GPGPU Benchmarks. These benchmarks are designed to measure GPGPU computing performance via different OpenCL workloads. Every benchmark methods are designed to work on up to 16 GPUs, including AMD, Intel and nVIDIA GPUs, in any combination. Of course CrossFire and SLI configurations, and both dGPUs and APUs are also fully supported. HSA configurations are handled via preliminary support. Basically any computing capable device will be benchmarked that appears as a GPU device among OpenCL devices.
The OpenCL benchmark methods currently offered are not specifically optimized for any GPU architectures. Instead, the AIDA64 OpenCL module relies on the OpenCL compiler to optimize the OpenCL kernel to run best on the underlying hardware. The OpenCL kernels used for these benchmarks are compiled in real-time, using the actual OpenCL driver the OpenCL GPU device belongs to. Due to that approach, it is always best to have all video drivers (Catalyst, ForceWare, HD Graphics, etc) updated to their latest & greatest version. For compilation the following OpenCL compiler options are passed: -cl-fast-relaxed-math -cl-mad-enable.
On top of that, the GPGPU Benchmark Panel also has a CPU column, for comparison purposes. The CPU measurements however are not obtained via OpenCL, but using native x86/x64 machine code, utilizing available instruction set extensions like SSE, AVX, AVX2, FMA and XOP. The CPU benchmarks are very similar to the old CPU and FPU benchmarks AIDA64 has got, but this time they measure maximum computing rates (FLOPS, IOPS). The CPU benchmarks are heavily multi-threaded, and are optimized for every CPU architectures introduced since the first Pentium came out.
The following benchmark methods are currently offered. We've indicated the x86/x64 CPU benchmark difference in brackets where there is a different approach in benchmarking.
1) Memory Read: Measures the bandwidth between the GPU device and the CPU, effectively measuring the performance the GPU could copy data from its own device memory into the system memory. It is also called Device-to-Host Bandwidth. [[[ The CPU benchmark measures the classic memory read bandwidth, the performance the CPU could read data from the system memory. ]]]
2) Memory Write: Measures the bandwidth between the CPU and the GPU device, effectively measuring the performance the GPU could copy data from the system memory into its own device memory. It is also called Host-to-Device Bandwidth. [[[ The CPU benchmark measures the classic memory write bandwidth, the performance the CPU could write data into the system memory. ]]]
3) Memory Copy: Measures the performance of the GPU's own device memory, effectively measuring the performance the GPU could copy data from its own device memory to another place in the same device memory. It is also called Device-to-Device Bandwidth. [[[ The CPU benchmark measures the classic memory copy bandwidth, the performance the CPU could move data in the system memory from one place to another. ]]]
4) Single-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with single-precision (32-bit, "float") floating-point data.
5) Double-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with double-precision (64-bit, "double") floating-point data. Not all GPUs support double-precision floating-point operations. For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.
6) 24-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 24-bit integer ("int24") data. This special data type are defined in OpenCL on the basis that many GPUs are capable of executing int24 operations via their floating-point units, effectively increasing the integer performance by a factor of 3 to 5, as compared to using 32-bit integer operations.
7) 32-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 32-bit integer ("int") data.
8) 64-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 64-bit integer ("long") data. Most GPUs do not have dedicated execution resources for 64-bit integer operations, so instead they emulate the 64-bit integer operations via existing 32-bit integer execution units. In such case 64-bit integer performance could be very low.
9) Single-Precision Julia: Measures the single-precision (32-bit, "float") floating-point performance through the computation of several frames of the popular "Julia" fractal.
10) Double-Precision Mandel: Measures the double-precision (64-bit, "double") floating-point performance through the computation of several frames of the popular "Mandelbrot" fractal. Not all GPUs support double-precision floating-point operations. For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.
As for the GPGPU Benchmark Panel's user interface:
1) You can use the checkboxes to enable or disable utilizing a specific GPU device or the CPU. The state of the CPU checkbox is remembered after closing and re-opening the panel.
2) You can launch the benchmarks for the selected devices by pushing the Start Benchmark button. In case you want to run all benchmarks, but only on the GPU(s), you can double-click on the GPU column label to do so. In case you only want to run the Memory Read benchmarks on both the GPU(s) and the CPU, you can double-click on the Memory Read label to do so. In case you only want to run the Memory Read benchmark on only the GPU(s), you can double-click on the cell where the requested result should appear after the benchmark is completed.
3) The benchmarks are executed simultaneously on all selected GPUs, using multiple threads and multiple OpenCL context, each with a single command queue. CPU benchmarks however are only launched after the GPU benchmarks are completed. It is currently not possible to run the GPU and CPU benchmarks simultaneously.
4) In case the system has multiple GPUs, the first results column will display an aggregated score for all GPUs. The individual GPU results are combined (added up), and the column label will read e.g. "4 GPUs". If you want to check the individual results, you can either uncheck some of the GPUs until just one GPU is left checked, or push the Results button to open the results window.
5) In case you've got exactly two GPU devices, and you disable the CPU test by unclicking its checkbox, the panel will switch to dual-GPU mode where the first column will be used for GPU1 results, and the second column will be used for GPU2 results. If after obtaining the results you want to check the combined performance of GPU1+GPU2, just check the CPU again, and the interface will switch back to the default layout.
Q: Is it possible to measure performance of OpenCL CPU devices?
A: No, it's not available currently, because OpenCL CPU drivers are simply not suitable for proper benchmarking. They execute code a lot slower than native x86/x64 machine code or sometimes even regular multi-threaded C++ code.
Q: Do AIDA64 GPGPU benchmarks use vectorized data types and unrolling techniques to boost performance?
A: Yes, both, in order to make the job of OpenCL compilers a bit easier. On top of that, the OpenCL compiler may still use additional optimizations, like further unrolling, it is completely up to the OpenCL compiler.
Q: Is the OpenCL-capable VIA chipset (VX11) supported?
A: No, because currently there's no stable OpenCL compiler and OpenCL driver for VIA chipsets or processors.
Q: Are OpenCL 2.0 and HSA supported on AMD Kaveri systems?
A: Yes, except for the memory benchmarks. Memory benchmarks currently don't work with HSA, because the current AMD HSA implementation doesn't yet support forcing the usage of device memory, but instead it automatically assumes that allocated memory blocks are to be shared between the CPU and GPU. As soon as AMD's OpenCL 2.0 and HSA implementation gets more mature, these issues will be resolved.
Q: Are the latest generation dGPUs, like AMD Radeon R9 290/290X, nVIDIA GeForce GTX Titan and GTX 780 fully supported?
A: Yes, but on such dGPUs where clock boosting and/or throttling is used, it is very important to decide whether you want to measure the absolute maximum attainable performance, or the average performance. If you're looking for the absolute maximum scores, then make sure to start AIDA64 GPGPU Benchmarks when the video card is cool, and with power limits set to a relaxed value (AMD PowerControl). If you're looking for the average performance, then make sure to disable the CPU benchmarks, and execute the GPU benchmark methods at least 10 times right after each other, to properly heat the video card up.
Q: Is OpenCL benchmarking under Windows 8.1 and Windows Server 2012 R2 supported?
A: Yes, as long as the video drivers are properly installed.
Q: On the Intel Core i7 "Haswell" processor, the CPU results are all considerably higher than the Intel HD Graphics 4600 "GT2" GPU results. How is that possible?
A: AIDA64 CPU benchmarks are heavily optimized for Haswell and all other modern CPU architectures, and they utilize any available instruction set extensions like SSE, AVX, AVX2, FMA or XOP, and of course full vectorization as well. Using FMA and AVX2, a quad-core Haswell's x86/x64 part can indeed provide very high computing performance, well exceeding the performance of its GT2 iGPU. It is however much easier to write such optimized code for the iGPU via OpenCL, than for the CPU via machine code generator or x86/x64 assembly.
You can try the new OpenCL GPGPU Benchmarks in the following new beta release of AIDA64 Extreme:
After upgrading to this new version, make sure to restart Windows to finalize the upgrade.
Please let us know here in this topic if you've got any comments or ideas about the new benchmarks.