Jump to content
AIDA64 Discussion Forum

Recommended Posts

Fiery    298

We're rolling out a new major update to AIDA64 in a few weeks.  It will feature the usual improvements to support the latest and greatest hardware technologies, such as GPU details for AMD Radeon R5, R7 and R9 Series and nVIDIA GeForce GTX 760 Ti OEM, and optimized benchmarks for AMD Kaveri and Intel Bay Trail.

 

But most importantly, we're introducing a brand new benchmark panel that offers a set of OpenCL GPGPU benchmarks that you can launch from AIDA64 / main menu / Tools / GPGPU Benchmarks.  These benchmarks are designed to measure GPGPU computing performance via different OpenCL workloads.  Every benchmark methods are designed to work on up to 16 GPUs, including AMD, Intel and nVIDIA GPUs, in any combination.  Of course CrossFire and SLI configurations, and both dGPUs and APUs are also fully supported.  HSA configurations are handled via preliminary support.  Basically any computing capable device will be benchmarked that appears as a GPU device among OpenCL devices.

 

The OpenCL benchmark methods currently offered are not specifically optimized for any GPU architectures. Instead, the AIDA64 OpenCL module relies on the OpenCL compiler to optimize the OpenCL kernel to run best on the underlying hardware.  The OpenCL kernels used for these benchmarks are compiled in real-time, using the actual OpenCL driver the OpenCL GPU device belongs to.  Due to that approach, it is always best to have all video drivers (Catalyst, ForceWare, HD Graphics, etc) updated to their latest & greatest version.  For compilation the following OpenCL compiler options are passed: -cl-fast-relaxed-math -cl-mad-enable.

 

On top of that, the GPGPU Benchmark Panel also has a CPU column, for comparison purposes.  The CPU measurements however are not obtained via OpenCL, but using native x86/x64 machine code, utilizing available instruction set extensions like SSE, AVX, AVX2, FMA and XOP.  The CPU benchmarks are very similar to the old CPU and FPU benchmarks AIDA64 has got, but this time they measure maximum computing rates (FLOPS, IOPS).  The CPU benchmarks are heavily multi-threaded, and are optimized for every CPU architectures introduced since the first Pentium came out.

 

The following benchmark methods are currently offered.  We've indicated the x86/x64 CPU benchmark difference in brackets where there is a different approach in benchmarking.

 

1) Memory Read: Measures the bandwidth between the GPU device and the CPU, effectively measuring the performance the GPU could copy data from its own device memory into the system memory.  It is also called Device-to-Host Bandwidth.  [[[ The CPU benchmark measures the classic memory read bandwidth, the performance the CPU could read data from the system memory. ]]]

 

2) Memory Write: Measures the bandwidth between the CPU and the GPU device, effectively measuring the performance the GPU could copy data from the system memory into its own device memory.  It is also called Host-to-Device Bandwidth.  [[[ The CPU benchmark measures the classic memory write bandwidth, the performance the CPU could write data into the system memory. ]]]

 

3) Memory Copy: Measures the performance of the GPU's own device memory, effectively measuring the performance the GPU could copy data from its own device memory to another place in the same device memory.  It is also called Device-to-Device Bandwidth.  [[[ The CPU benchmark measures the classic memory copy bandwidth, the performance the CPU could move data in the system memory from one place to another. ]]]

 

4) Single-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with single-precision (32-bit, "float") floating-point data.

 

5) Double-Precision FLOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as FLOPS (Floating-Point Operations Per Second), with double-precision (64-bit, "double") floating-point data.  Not all GPUs support double-precision floating-point operations.  For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.

 

6) 24-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 24-bit integer ("int24") data.  This special data type are defined in OpenCL on the basis that many GPUs are capable of executing int24 operations via their floating-point units, effectively increasing the integer performance by a factor of 3 to 5, as compared to using 32-bit integer operations.

 

7) 32-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 32-bit integer ("int") data.

 

8) 64-bit Integer IOPS: Measures the classic MAD (Multiply-Addition) performance of the GPU, otherwise known as IOPS (Integer Operations Per Second), with 64-bit integer ("long") data.  Most GPUs do not have dedicated execution resources for 64-bit integer operations, so instead they emulate the 64-bit integer operations via existing 32-bit integer execution units.  In such case 64-bit integer performance could be very low.

 

9) Single-Precision Julia: Measures the single-precision (32-bit, "float") floating-point performance through the computation of several frames of the popular "Julia" fractal.

 

10) Double-Precision Mandel: Measures the double-precision (64-bit, "double") floating-point performance through the computation of several frames of the popular "Mandelbrot" fractal.  Not all GPUs support double-precision floating-point operations.  For example, all current Intel desktop and mobile graphics devices only support single-precision floating-point operations.

 

------------------------------------------------------------------------

 

As for the GPGPU Benchmark Panel's user interface:

 

1) You can use the checkboxes to enable or disable utilizing a specific GPU device or the CPU.  The state of the CPU checkbox is remembered after closing and re-opening the panel.

 

2) You can launch the benchmarks for the selected devices by pushing the Start Benchmark button.  In case you want to run all benchmarks, but only on the GPU(s), you can double-click on the GPU column label to do so.  In case you only want to run the Memory Read benchmarks on both the GPU(s) and the CPU, you can double-click on the Memory Read label to do so.  In case you only want to run the Memory Read benchmark on only the GPU(s), you can double-click on the cell where the requested result should appear after the benchmark is completed.

 

3) The benchmarks are executed simultaneously on all selected GPUs, using multiple threads and multiple OpenCL context, each with a single command queue.  CPU benchmarks however are only launched after the GPU benchmarks are completed.  It is currently not possible to run the GPU and CPU benchmarks simultaneously.

 

4) In case the system has multiple GPUs, the first results column will display an aggregated score for all GPUs.  The individual GPU results are combined (added up), and the column label will read e.g. "4 GPUs".  If you want to check the individual results, you can either uncheck some of the GPUs until just one GPU is left checked, or push the Results button to open the results window.

 

5) In case you've got exactly two GPU devices, and you disable the CPU test by unclicking its checkbox, the panel will switch to dual-GPU mode where the first column will be used for GPU1 results, and the second column will be used for GPU2 results.  If after obtaining the results you want to check the combined performance of GPU1+GPU2, just check the CPU again, and the interface will switch back to the default layout.

 

------------------------------------------------------------------------

 

FAQ:

 

Q: Is it possible to measure performance of OpenCL CPU devices?

A: No, it's not available currently, because OpenCL CPU drivers are simply not suitable for proper benchmarking.  They execute code a lot slower than native x86/x64 machine code or sometimes even regular multi-threaded C++ code.

 

Q: Do AIDA64 GPGPU benchmarks use vectorized data types and unrolling techniques to boost performance?

A: Yes, both, in order to make the job of OpenCL compilers a bit easier.  On top of that, the OpenCL compiler may still use additional optimizations, like further unrolling, it is completely up to the OpenCL compiler.

 

Q: Is the OpenCL-capable VIA chipset (VX11) supported?

A: No, because currently there's no stable OpenCL compiler and OpenCL driver for VIA chipsets or processors.

 

Q: Are OpenCL 2.0 and HSA supported on AMD Kaveri systems?

A: Yes, except for the memory benchmarks.  Memory benchmarks currently don't work with HSA, because the current AMD HSA implementation doesn't yet support forcing the usage of device memory, but instead it automatically assumes that allocated memory blocks are to be shared between the CPU and GPU.  As soon as AMD's OpenCL 2.0 and HSA implementation gets more mature, these issues will be resolved.

 

Q: Are the latest generation dGPUs, like AMD Radeon R9 290/290X, nVIDIA GeForce GTX Titan and GTX 780 fully supported?

A: Yes, but on such dGPUs where clock boosting and/or throttling is used, it is very important to decide whether you want to measure the absolute maximum attainable performance, or the average performance.  If you're looking for the absolute maximum scores, then make sure to start AIDA64 GPGPU Benchmarks when the video card is cool, and with power limits set to a relaxed value (AMD PowerControl).  If you're looking for the average performance, then make sure to disable the CPU benchmarks, and execute the GPU benchmark methods at least 10 times right after each other, to properly heat the video card up.

 

Q: Is OpenCL benchmarking under Windows 8.1 and Windows Server 2012 R2 supported?

A: Yes, as long as the video drivers are properly installed.

 

Q: On the Intel Core i7 "Haswell" processor, the CPU results are all considerably higher than the Intel HD Graphics 4600 "GT2" GPU results.  How is that possible?

A: AIDA64 CPU benchmarks are heavily optimized for Haswell and all other modern CPU architectures, and they utilize any available instruction set extensions like SSE, AVX, AVX2, FMA or XOP, and of course full vectorization as well.  Using FMA and AVX2, a quad-core Haswell's x86/x64 part can indeed provide very high computing performance, well exceeding the performance of its GT2 iGPU.  It is however much easier to write such optimized code for the iGPU via OpenCL, than for the CPU via machine code generator or x86/x64 assembly.

 

------------------------------------------------------------------------

 

You can try the new OpenCL GPGPU Benchmarks in the following new beta release of AIDA64 Extreme:

 

http://www.aida64.com/downloads/aida64extremebuild2656b7hl0kzgtszip
 

After upgrading to this new version, make sure to restart Windows to finalize the upgrade.

 

Please let us know here in this topic if you've got any comments or ideas about the new benchmarks.

Share this post


Link to post
Share on other sites
MAA    1

Thank you for OpenCL benchmark!

Do you have plan to add results of reference systems (like current CPU and Memory benchmarks)?

Share this post


Link to post
Share on other sites
Fiery    298

Maintaining a list of reference systems could be problemous, because unlike classic x86/x64 CPU and FPU benchmarks, results of OpenCL benchmarks heavily depend on the software environment.  OpenCL benchmark results may vary between video driver updates, may change when you go from Windows 8 to Windows 8.1, etc.

Share this post


Link to post
Share on other sites
MAA    1

I have tested Intel HD4600

x64 CPU: Memory Read/Write ~23000 MB/s

on GPU:  Memory Read/Write ~8900 MB/s

 

why GPU memory read/write so slow?

Share this post


Link to post
Share on other sites
Fiery    298

I have tested Intel HD4600

x64 CPU: Memory Read/Write ~23000 MB/s

on GPU:  Memory Read/Write ~8900 MB/s

 

why GPU memory read/write so slow?

 

One of the limitations of the current Intel iGPU solution ;)  But they fight it successfully by implementing caches in the iGPU.

Share this post


Link to post
Share on other sites
TRINITAS91    1

Hi, I did the benchmark of my hardware^^

Only GPGPU--

 

NVIDIA

GeForce 8800 ULTRA:
   * FP32: 374 GFLOP
   * FP64: No supported
   * INT24: 370 GIOP
   * INT32: 73 GIOP
   * INT64: 17 GIOP
   * Julia FP32: 62 FPS
   * Mandel FP64: No supported

GeForce 9800 GTX:
   * FP32: 386 GFLOP
   * FP64: Non supportée
   * INT24: 384 GIOP
   * INT32: 68 GIOP
   * INT64: 15 GIOP
   * Julia FP32: 59 FPS
   * Mandel FP64: No supported

Gigabyte GeForce GTX 280:
   * FP32: 434 GFLOP
   * FP64: 54 GFLOP
   * INT24: 422 GIOP
   * INT32: 77 GIOP
   * INT64: 17 GIOP
   * Julia FP32: 83 FPS
   * Mandel FP64: 12 FPS
EVGA GeForce GTX 660 Ti SC:
   * FP32: 2200 GFLOP
   * FP64: 105 GFLOP
   * INT24: 420 GIOP
   * INT32: 420 GIOP
   * INT64: 105 GIOP
   * Julia FP32: 418 FPS
   * Mandel FP64: 27 FPS

 

AMD/ATI
Mobility Radeon HD5470:
   * FP32: 97 GFLOP
   * FP64: No supported
   * INT24: 24 GIOP
   * INT32: 24 GIOP
   * INT64: 5 GIOP
   * Julia FP32: 17 FPS
   * Mandel FP64: No supported

 

CPU

Athlon 64 3200+ (1 core 2000 Mhz):
   * FP32: 8 GFLOP
   * FP64: 4 GFLOP
   * INT24: 3 GIOP
   * INT32: 3 GIOP
   * INT64: 0.5 GIOP
   * Julia FP32: 3 FPS
   * Mandel FP64: 1.7 FPS
i5 430M (2 cores, 4 Threads 2270 Mhz):
   * FP32: 40 GFLOP
   * FP64: 20 GFLOP
   * INT24: 20 GIOP
   * INT32: 20 GIOP
   * INT64: 10 GIOP
   * Julia FP32: 20 FPS
   * Mandel FP64: 9.5 FPS
FX-8350 (8 cores, 8 Threads 4200 Mhz):
   * FP32: 222 GFLOP
   * FP64: 110 GFLOP
   * INT24: 65 GIOP
   * INT32: 65 GIOP
   * INT64: 8 GIOP
   * Julia FP32: 67 FPS
   * Mandel FP64: 34 FPS

 

And you?^^

  • Like 1

Share this post


Link to post
Share on other sites
Fiery    298

We do have plans for more OpenCL GPGPU benchmarks, e.g. Hash, AES and ray-tracing.  We'll work on them in 2014 ;)

Share this post


Link to post
Share on other sites
TRINITAS91    1

With the new version of AIDA64 and my new GPU, here new bench comparison with SHA1 and AES-256.

 

It is impossible for me to test the HD3870 (For being only CAL) and HD4870 (impossible to use OpenCL). I'll add the HD7970 and HD6990 later I receive in the week ^ ^.

 

sha110.jpg

 

aes-2510.jpg

Share this post


Link to post
Share on other sites
Fiery    298

That must have been a hell of a job to compile those results. Thank you for posting them!

Share this post


Link to post
Share on other sites
Cyk    2

Almost 10TFlops in single precision.

34rxc9u.jpg

Each card has aprox.

4400 GFlops SP

23000 MB/s AES-256

42000 MB/s SHA-1 Hash

@Thanks corrected.

Share this post


Link to post
Share on other sites
Fiery    298

Thank you for posting your scores. I suppose you meant to write 10 TFLOPS, instead of 1 TFLOPS ;)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×