OpenCL GPGPU benchmarks

Fiery · October 30, 2013

We're rolling out a new major update to AIDA64 in a few weeks. It will feature the usual improvements to support the latest and greatest hardware technologies, such as GPU details for AMD Radeon R5, R7 and R9 Series and nVIDIA GeForce GTX 760 Ti OEM, and optimized benchmarks for AMD Kaveri and Intel Bay Trail.

But most importantly, we're introducing a brand new benchmark panel that offers a set of OpenCL GPGPU benchmarks that you can launch from AIDA64 / main menu / Tools / GPGPU Benchmarks. These benchmarks are designed to measure GPGPU computing performance via different OpenCL workloads. Every benchmark methods are designed to work on up to 16 GPUs, including AMD, Intel and nVIDIA GPUs, in any combination. Of course CrossFire and SLI configurations, and both dGPUs and APUs are also fully supported. HSA configurations are handled via preliminary support. Basically any computing capable device will be benchmarked that appears as a GPU device among OpenCL devices.

The OpenCL benchmark methods currently offered are not specifically optimized for any GPU architectures. Instead, the AIDA64 OpenCL module relies on the OpenCL compiler to optimize the OpenCL kernel to run best on the underlying hardware. The OpenCL kernels used for these benchmarks are compiled in real-time, using the actual OpenCL driver the OpenCL GPU device belongs to. Due to that approach, it is always best to have all video drivers (Catalyst, ForceWare, HD Graphics, etc) updated to their latest & greatest version. For compilation the following OpenCL compiler options are passed: -cl-fast-relaxed-math -cl-mad-enable.

On top of that, the GPGPU Benchmark Panel also has a CPU column, for comparison purposes. The CPU measurements however are not obtained via OpenCL, but using native x86/x64 machine code, utilizing available instruction set extensions like SSE, AVX, AVX2, FMA and XOP. The CPU benchmarks are very similar to the old CPU and FPU benchmarks AIDA64 has got, but this time they measure maximum computing rates (FLOPS, IOPS). The CPU benchmarks are heavily multi-threaded, and are optimized for every CPU architectures introduced since the first Pentium came out.

The following benchmark methods are currently offered. We've indicated the x86/x64 CPU benchmark difference in brackets where there is a different approach in benchmarking.

1) Memory Read: Measures the bandwidth between the GPU device and the CPU, effectively measuring the performance the GPU could copy data from its own device memory into the system memory. It is also called Device-to-Host Bandwidth. [[[ The CPU benchmark measures the classic memory read bandwidth, the performance the CPU could read data from the system memory. ]]]

2) Memory Write: Measures the bandwidth between the CPU and the GPU device, effectively measuring the performance the GPU could copy data from the system memory into its own device memory. It is also called Host-to-Device Bandwidth. [[[ The CPU benchmark measures the classic memory write bandwidth, the performance the CPU could write data into the system memory. ]]]

3) Memory Copy: Measures the performance of the GPU's own device memory, effectively measuring the performance the GPU could copy data from its own device memory to another place in the same device memory. It is also called Device-to-Device Bandwidth. [[[ The CPU benchmark measures the classic memory copy bandwidth, the performance the CPU could move data in the system memory from one place to another. ]]]

4) Single-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with single-precision (32-bit, "float") floating-point data.

5) Double-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with double-precision (64-bit, "double") floating-point data. Not all GPUs support double-precision floating-point operations. For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.

6) 24-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 24-bit integer ("int24") data. This special data type are defined in OpenCL on the basis that many GPUs are capable of executing int24 operations via their floating-point units, effectively increasing the integer performance by a factor of 3 to 5, as compared to using 32-bit integer operations.

7) 32-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 32-bit integer ("int") data.

8) 64-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 64-bit integer ("long") data. Most GPUs do not have dedicated execution resources for 64-bit integer operations, so instead they emulate the 64-bit integer operations via existing 32-bit integer execution units. In such case 64-bit integer performance could be very low.

9) Single-Precision Julia: Measures the single-precision (32-bit, "float") floating-point performance through the computation of several frames of the popular "Julia" fractal.

10) Double-Precision Mandel: Measures the double-precision (64-bit, "double") floating-point performance through the computation of several frames of the popular "Mandelbrot" fractal. Not all GPUs support double-precision floating-point operations. For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.

------------------------------------------------------------------------

As for the GPGPU Benchmark Panel's user interface:

1) You can use the checkboxes to enable or disable utilizing a specific GPU device or the CPU. The state of the CPU checkbox is remembered after closing and re-opening the panel.

2) You can launch the benchmarks for the selected devices by pushing the Start Benchmark button. In case you want to run all benchmarks, but only on the GPU(s), you can double-click on the GPU column label to do so. In case you only want to run the Memory Read benchmarks on both the GPU(s) and the CPU, you can double-click on the Memory Read label to do so. In case you only want to run the Memory Read benchmark on only the GPU(s), you can double-click on the cell where the requested result should appear after the benchmark is completed.

3) The benchmarks are executed simultaneously on all selected GPUs, using multiple threads and multiple OpenCL context, each with a single command queue. CPU benchmarks however are only launched after the GPU benchmarks are completed. It is currently not possible to run the GPU and CPU benchmarks simultaneously.

4) In case the system has multiple GPUs, the first results column will display an aggregated score for all GPUs. The individual GPU results are combined (added up), and the column label will read e.g. "4 GPUs". If you want to check the individual results, you can either uncheck some of the GPUs until just one GPU is left checked, or push the Results button to open the results window.

5) In case you've got exactly two GPU devices, and you disable the CPU test by unclicking its checkbox, the panel will switch to dual-GPU mode where the first column will be used for GPU1 results, and the second column will be used for GPU2 results. If after obtaining the results you want to check the combined performance of GPU1+GPU2, just check the CPU again, and the interface will switch back to the default layout.

------------------------------------------------------------------------

FAQ:

Q: Is it possible to measure performance of OpenCL CPU devices?

A: No, it's not available currently, because OpenCL CPU drivers are simply not suitable for proper benchmarking. They execute code a lot slower than native x86/x64 machine code or sometimes even regular multi-threaded C++ code.

Q: Do AIDA64 GPGPU benchmarks use vectorized data types and unrolling techniques to boost performance?

A: Yes, both, in order to make the job of OpenCL compilers a bit easier. On top of that, the OpenCL compiler may still use additional optimizations, like further unrolling, it is completely up to the OpenCL compiler.

Q: Is the OpenCL-capable VIA chipset (VX11) supported?

A: No, because currently there's no stable OpenCL compiler and OpenCL driver for VIA chipsets or processors.

Q: Are OpenCL 2.0 and HSA supported on AMD Kaveri systems?

A: Yes, except for the memory benchmarks. Memory benchmarks currently don't work with HSA, because the current AMD HSA implementation doesn't yet support forcing the usage of device memory, but instead it automatically assumes that allocated memory blocks are to be shared between the CPU and GPU. As soon as AMD's OpenCL 2.0 and HSA implementation gets more mature, these issues will be resolved.

Q: Are the latest generation dGPUs, like AMD Radeon R9 290/290X, nVIDIA GeForce GTX Titan and GTX 780 fully supported?

A: Yes, but on such dGPUs where clock boosting and/or throttling is used, it is very important to decide whether you want to measure the absolute maximum attainable performance, or the average performance. If you're looking for the absolute maximum scores, then make sure to start AIDA64 GPGPU Benchmarks when the video card is cool, and with power limits set to a relaxed value (AMD PowerControl). If you're looking for the average performance, then make sure to disable the CPU benchmarks, and execute the GPU benchmark methods at least 10 times right after each other, to properly heat the video card up.

Q: Is OpenCL benchmarking under Windows 8.1 and Windows Server 2012 R2 supported?

A: Yes, as long as the video drivers are properly installed.

Q: On the Intel Core i7 "Haswell" processor, the CPU results are all considerably higher than the Intel HD Graphics 4600 "GT2" GPU results. How is that possible?

A: AIDA64 CPU benchmarks are heavily optimized for Haswell and all other modern CPU architectures, and they utilize any available instruction set extensions like SSE, AVX, AVX2, FMA or XOP, and of course full vectorization as well. Using FMA and AVX2, a quad-core Haswell's x86/x64 part can indeed provide very high computing performance, well exceeding the performance of its GT2 iGPU. It is however much easier to write such optimized code for the iGPU via OpenCL, than for the CPU via machine code generator or x86/x64 assembly.

------------------------------------------------------------------------

You can try the new OpenCL GPGPU Benchmarks in the following new beta release of AIDA64 Extreme:

http://www.aida64.com/downloads/aida64extremebuild2656b7hl0kzgtszip

After upgrading to this new version, make sure to restart Windows to finalize the upgrade.

Please let us know here in this topic if you've got any comments or ideas about the new benchmarks.

Fiery · October 30, 2013

Thanks,but link unavailable atm......

Fixed. Thank you for noticing.

Fiery · November 6, 2013

Here's a new AIDA64 beta update that further improves the new GPGPU Benchmark Panel layout and handling:

http://www.aida64.com/downloads/aida64extremebuild2651ztbq8cv4pfzip

Fiery · November 11, 2013

Here's another new AIDA64 beta update that further improves the new GPGPU Benchmark Panel layout and handling:

http://www.aida64.com/downloads/aida64extremebuild2656b7hl0kzgtszip

MAA · November 13, 2013

Thank you for OpenCL benchmark!

Do you have plan to add results of reference systems (like current CPU and Memory benchmarks)?

Fiery · November 13, 2013

Maintaining a list of reference systems could be problemous, because unlike classic x86/x64 CPU and FPU benchmarks, results of OpenCL benchmarks heavily depend on the software environment. OpenCL benchmark results may vary between video driver updates, may change when you go from Windows 8 to Windows 8.1, etc.

MAA · November 18, 2013

I have tested Intel HD4600

x64 CPU: Memory Read/Write ~23000 MB/s

on GPU: Memory Read/Write ~8900 MB/s

why GPU memory read/write so slow?

Fiery · November 20, 2013

I have tested Intel HD4600

x64 CPU: Memory Read/Write ~23000 MB/s

on GPU: Memory Read/Write ~8900 MB/s

why GPU memory read/write so slow?

One of the limitations of the current Intel iGPU solution But they fight it successfully by implementing caches in the iGPU.

TRINITAS91 · December 6, 2013

Hi, I did the benchmark of my hardware^^

Only GPGPU--

NVIDIA

GeForce 8800 ULTRA:
* FP32: 374 GFLOP
* FP64: No supported
* INT24: 370 GIOP
* INT32: 73 GIOP
* INT64: 17 GIOP
* Julia FP32: 62 FPS
* Mandel FP64: No supported

GeForce 9800 GTX:
* FP32: 386 GFLOP
* FP64: Non supportÃ©e
* INT24: 384 GIOP
* INT32: 68 GIOP
* INT64: 15 GIOP
* Julia FP32: 59 FPS
* Mandel FP64: No supported

Gigabyte GeForce GTX 280:
* FP32: 434 GFLOP
* FP64: 54 GFLOP
* INT24: 422 GIOP
* INT32: 77 GIOP
* INT64: 17 GIOP
* Julia FP32: 83 FPS
* Mandel FP64: 12 FPS
EVGA GeForce GTX 660 Ti SC:
* FP32: 2200 GFLOP
* FP64: 105 GFLOP
* INT24: 420 GIOP
* INT32: 420 GIOP
* INT64: 105 GIOP
* Julia FP32: 418 FPS
* Mandel FP64: 27 FPS

AMD/ATI
Mobility Radeon HD5470:
* FP32: 97 GFLOP
* FP64: No supported
* INT24: 24 GIOP
* INT32: 24 GIOP
* INT64: 5 GIOP
* Julia FP32: 17 FPS
* Mandel FP64: No supported

CPU

Athlon 64 3200+ (1 core 2000 Mhz):
* FP32: 8 GFLOP
* FP64: 4 GFLOP
* INT24: 3 GIOP
* INT32: 3 GIOP
* INT64: 0.5 GIOP
* Julia FP32: 3 FPS
* Mandel FP64: 1.7 FPS
i5 430M (2 cores, 4 Threads 2270 Mhz):
* FP32: 40 GFLOP
* FP64: 20 GFLOP
* INT24: 20 GIOP
* INT32: 20 GIOP
* INT64: 10 GIOP
* Julia FP32: 20 FPS
* Mandel FP64: 9.5 FPS
FX-8350 (8 cores, 8 Threads 4200 Mhz):
* FP32: 222 GFLOP
* FP64: 110 GFLOP
* INT24: 65 GIOP
* INT32: 65 GIOP
* INT64: 8 GIOP
* Julia FP32: 67 FPS
* Mandel FP64: 34 FPS

And you?^^

Fiery · December 6, 2013

Nice 1st post, thank you And you've got quite a fleet of PCs

MAA · December 17, 2013

Do you have plan to add test (hash calculation, for example) with other integer operations (shift/rotate, used in btc mining)?

http://www.extremetech.com/computing/153467-amd-destroys-nvidia-bitcoin-mining/2

Fiery · December 18, 2013

We do have plans for more OpenCL GPGPU benchmarks, e.g. Hash, AES and ray-tracing. We'll work on them in 2014

MAA · December 18, 2013

Excellent!

p.s. in my opinion, gpu have no chance vs. the CPU with aes-ni ))

TRINITAS91 · July 11, 2014

With the new version of AIDA64 and my new GPU, here new bench comparison with SHA1 and AES-256.

It is impossible for me to test the HD3870 (For being only CAL) and HD4870 (impossible to use OpenCL). I'll add the HD7970 and HD6990 later I receive in the week ^ ^.

Fiery · July 12, 2014

That must have been a hell of a job to compile those results. Thank you for posting them!

Cyk · August 8, 2014

Almost 10TFlops in single precision.

Each card has aprox.

4400 GFlops SP

23000 MB/s AES-256

42000 MB/s SHA-1 Hash

@Thanks corrected.

Fiery · August 9, 2014

Thank you for posting your scores. I suppose you meant to write 10 TFLOPS, instead of 1 TFLOPS

TRINITAS91 · November 11, 2018

Hi all,

I return for tou know when there will be new précisions for GPGPU ?

I think for INT4 - INT8 - INT16, and FP8 - FP16 for GPU and CPU.

Fiery · November 13, 2018

On ‎11‎/‎11‎/‎2018 at 8:08 PM, TRINITAS91 said:

Hi all,

I return for tou know when there will be new précisions for GPGPU ?

I think for INT4 - INT8 - INT16, and FP8 - FP16 for GPU and CPU.

Maybe next year

Rayden · November 20, 2018

When I open GPGPU benchmarks, only the integrated GPU of i7-7700HQ is displayed.

The system also has the NVIDIA Geforce 1060 GTX included, but it doesn't appear.

When I force the system (via Nexoc Control Center) and AIDA64.exe (via NVIDIA system and Win10)

to use discrete GPU, AIDA64 doesn't find a GU at all.

I am using Windows 10 Pro 1803 and ForceWare 416.34.

Fiery · November 20, 2018

13 hours ago, Rayden said:

When I open GPGPU benchmarks, only the integrated GPU of i7-7700HQ is displayed.

The system also has the NVIDIA Geforce 1060 GTX included, but it doesn't appear.

When I force the system (via Nexoc Control Center) and AIDA64.exe (via NVIDIA system and Win10)

to use discrete GPU, AIDA64 doesn't find a GU at all.

I am using Windows 10 Pro 1803 and ForceWare 416.34.

There must be something wrong about the OpenCL software stack. Try to uninstall and reinstall ForceWare.

TRINITAS91 · December 10, 2018

My actually config

MAA · January 10

On 10/30/2013 at 7:05 PM, Fiery said:

4) Single-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with single-precision (32-bit, "float") floating-point data.

5) Double-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with double-precision (64-bit, "double") floating-point data.

6) 24-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 24-bit integer ("int24") data.

7) 32-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 32-bit integer ("int") data.

😎 64-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 64-bit integer ("long") data.

what exact method do you use to measure flops/iops? matrix multiplication or something else?

Fiery · January 10

3 hours ago, MAA said:

what exact method do you use to measure flops/iops? matrix multiplication or something else?

Simple MAD/FMA instructions.

MAA · June 6

Can you add LINPACK benchmark to the GPGPU and CPU FPU benchmarks?

Sign In

OpenCL GPGPU benchmarks

Recommended Posts

Fiery

Fiery

Fiery

Fiery

MAA

Fiery

MAA

Fiery

TRINITAS91

Fiery

MAA

Fiery

MAA

TRINITAS91

Fiery

Cyk

Fiery

TRINITAS91

Fiery

Rayden

Fiery

TRINITAS91

MAA

Fiery

MAA

Join the conversation

Similar Content

Support

Online Store

Browse

Activity