Jump to content
AIDA64 Discussion Forum
Fiery

New cache and memory benchmarks in AIDA64 v3.00

Recommended Posts

We're rolling out a new major update to AIDA64 on June 03, 2013.  It will feature the usual improvements to support the latest and greatest hardware technologies, such as GPU details for AMD Radeon HD 7990 "Malta" and nVIDIA GeForce 700 Series, and optimized benchmarks for AMD Kabini/Temash and Intel Haswell.  On Haswell AIDA64 v3.00 will utilize all new instruction set extensions, so the benchmarks and the System Stability Test will also use the appropriate AVX2, FMA and BMI2 instructions.

 

But more importantly, we've replaced the outdated set of memory benchmarks with brand new ones.  The new bandwidth (read, write copy) benchmarks now use multiple threads to squeeze out every last bit of performance from the caches and the memory modules.  On modern multi-core processors, using the old single-threaded benchmarks you couldn't see the actual memory bandwidth, but only the memory bandwidth available for single-threaded applications.  With the new benchmarks you will however get considerably higher scores, much closer to the theoretical memory bandwidth available.  It is especially true for 3-channel and 4-channel memory configurations, such as Intel X58 and X79 based high-end desktop systems; and also for NUMA-enabled multi-socket systems, such as 2- and 4-way AMD Opteron and Intel Xeon based servers and workstations.  For example:

 

Core i7-3960X with X79 chipset and 4-channel DDR3-1600:

- AIDA64 v2.85 Memory Read: 16825 MB/s  [ old ]

- AIDA64 v3.00 Memory Read: 45640 MB/s  [ new ]

 

We've also implemented multi-threaded cache benchmarks, that now include support for the latest AVX and AVX2 instruction set extensions.  Thanks to using all CPU cores for the cache benchmarks, you will get dramatically different cache bandwidth scores than with the old benchmarks.  For example:

 

Core i7-3960X with X79 chipset and 4-channel DDR3-1600:

- AIDA64 v2.85 L1 Cache Read: 121.8 GB/s  [ old ]

- AIDA64 v3.00 L1 Cache Read: 674.7 GB/s  [ new ]

 

And thanks to the 2x widened L1 cache bandwidth of Intel Haswell, on these new processors using the new cache benchmarks of AIDA64 v3.00 you will get unusually high scores.  Using Haswell, with a hint of overclock it's quite easy to cross the 1 TB/s mark for the L1 cache ;)

 

We've also replaced the old cache and memory latency benchmark with a brand new one that uses a different approach, recommended by processor architecture engineers.  The old memory latency benchmark used the classic forward-linear solution, so it "walked" the memory continuously, in forward direction.  Unfortunately that classic approach was sometimes over-optimized by "too smart" memory controllers, that led to unrealistically low latency scores.  It was a constant fight for us to get around those over-optimizations, to make sure AIDA64 provides stable and reliable latency results.  With the new latency benchmark we've switched to a block-random solution, that keeps "jumping" to random addresses inside a memory block for a period of time, and then skips to a new block and continues "jumping" to random places inside there as well.  With this new solution memory controllers cannot find a pattern anymore in the latency measurement, and so they cannot over-optimize the benchmark.  The block-random approach however means that latency results will be higher, and since the scores are in nanosec, it means the results will be worse than what you got used to.  For example:

 

Core i7-3960X with X79 chipset and 4-channel DDR3-1600:

- AIDA64 v2.85 Memory Latency: 55.9 ns  [ old ]

- AIDA64 v3.00 Memory Latency: 67.5 ns  [ new ]

 

AIDA64 v3.00 also supports benchmarking the eDRAM L4 cache of the upcoming Intel Crystal Well processors.  An interesting article about Crystal Well:

 

http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested

 

Less radical change, but still deserves a note: we've also revamped the framework around the CPU ZLib benchmark, so you can see a few percent gain in performance using AIDA64 v3.00.  For example:

 

Core i7-3960X with X79 chipset and 4-channel DDR3-1600:

- AIDA64 v2.85 CPU ZLib: 418.3 MB/s  [ old ]

- AIDA64 v3.00 CPU ZLib: 444.5 MB/s  [ new ]

 

We hope you'll find the new AIDA64 release useful.  Feel free to ask any technical or not-so-technical questions about AIDA64 v3.00 benchmarks in this topic.


 

Regards,

Fiery

 

Share this post


Link to post
Share on other sites

On our Z77-based Ivy Bridge Core i7-3770K system, with Dual Channel DDR3-1600 RAM we got the following scores using AIDA64 v3.00:

 

Memory Read / Write / Copy / Latency: 23559 MB/s / 24093 MB/s / 22772 MB/s / 57.5 ns

L1 Cache Read / Write / Copy / Latency: 461 GB/s / 231 GB/s / 460 GB/s / 1.1 ns

L2 Cache Read / Write / Copy / Latency: 244 GB/s / 149 GB/s / 216 GB/s / 3.3 ns

Share this post


Link to post
Share on other sites

How do you measure the speed of the memory on dual-socket systems?

Memory speed on system with two 3-channel Xeon must be equal to 6 channels?

Share this post


Link to post
Share on other sites

The new memory bandwidth benchmarks are NUMA-aware, so on a 2-socket system where each socket has its own 3-channel memory interface, the performance of the 6 memory channels will be aggregated in the memory benchmarks.

Share this post


Link to post
Share on other sites

Ok, I do not understand, why on your reference system with two Xeon X5550 (Triple DDR3-1333) you get only 37831 MB/s read (of theoretically 2*3*8*1333.3=64000 MB/s)?

Share this post


Link to post
Share on other sites

Ok, I do not understand, why on your reference system with two Xeon X5550 (Triple DDR3-1333) you get only 37831 MB/s read (of theoretically 2*3*8*1333.3=64000 MB/s)?

 

With server systems it's much more difficult to get close to the theoretical maximum of memory bandwidth than with 1-socket systems.  You also need a high CPU core clock speed and 6+ cores to drive the memory closer to its limit. The pair of X5550 processors we have in the Supermicro X8DTN+ motherboard have a relatively low clock speed of 2.66 GHz, and only 4 cores per socket.

Share this post


Link to post
Share on other sites
And thanks to the 2x widened L1 cache bandwidth of Intel Haswell, on these new processors using the new cache benchmarks of AIDA64 v3.00 you will get unusually high scores.  Using Haswell, with a hint of overclock it's quite easy to cross the 1 TB/s mark for the L1 cache ;)

 

We've also replaced the old cache and memory latency benchmark with a brand new one that uses a different approach, recommended by processor architecture engineers.  The old memory latency benchmark used the classic forward-linear solution, so it "walked" the memory continuously, in forward direction.

 

Hi Fiery, glad I am that at last I found the right forum to ask what interests me as an amateur C coder.

 

I have been using Everest, which is one/the excellent tool. Do you intend to provide one thread L1 cache read speeds along with your new multi-threaded ones? Recently I asked on one OC forum what is the HASWELL's L1 speed and was struck by 900GB/s, does that mean 900/8=~110GB/s is the Everest like result?!

 

And if it is not against your policy can you say how (what optimizations are in use) Everest reports 5400MB/s Main RAM read on my Core 2 T7500 while my BURST_Read_8DWORDSi: (64MB block) offers only 4.699MB/s - your code is much faster!

 

// 'BURST_Read_8DWORDSi' Main Loop:

.B3.3:
;;;     for(; Loop_Counter; Loop_Counter--, p += 4*sizeof(uint32_t)) {
;;;         hash32 = *(uint32_t *)(p+0) ^ *(uint32_t *)(p+0+Second_Line_Offset);
  02ebc 8b 07            mov eax, DWORD PTR [edi]               
;;;         hash32B = *(uint32_t *)(p+4) ^ *(uint32_t *)(p+4+Second_Line_Offset);
  02ebe 8b 77 04         mov esi, DWORD PTR [4+edi]             
;;;         hash32C = *(uint32_t *)(p+8) ^ *(uint32_t *)(p+8+Second_Line_Offset);
  02ec1 8b 57 08         mov edx, DWORD PTR [8+edi]             
;;;         hash32D = *(uint32_t *)(p+12) ^ *(uint32_t *)(p+12+Second_Line_Offset);
  02ec4 8b 4f 0c         mov ecx, DWORD PTR [12+edi]            
  02ec7 33 04 1f         xor eax, DWORD PTR [edi+ebx]           
  02eca 33 74 1f 04      xor esi, DWORD PTR [4+edi+ebx]         
  02ece 33 54 1f 08      xor edx, DWORD PTR [8+edi+ebx]         
  02ed2 33 4c 1f 0c      xor ecx, DWORD PTR [12+edi+ebx]        
  02ed6 83 c7 10         add edi, 16                            
  02ed9 4d               dec ebp                                
  02eda 75 e0            jne .B3.3 

Also can you share how the interleaved (i.e. halving the memory pool and reading in parallel) way of reading MAIN/L1 behaves on HASWELL? I hate the fact that cannot play with new toys myself.

Memory pool starting address: 00DF0040 ... 64 byte aligned, OK

Info1: One second seems to have 998 clocks.
Info2: This CPU seems to be working at 2,191 MHz.

Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS:         (64MB block); 65536MB fetched in 15132 clocks or 4.331MB per clock
BURST_Read_8DWORDSi:        (64MB block); 65536MB fetched in 13946 clocks or 4.699MB per clock
FNV1A_YoshimitsuTRIADiiXMM: (64MB block); 65536MB hashed in 13572 clocks or  4.829MB per clock !!! FLASHY-SLASHY: OUTSPEEDS THE INTERLEAVED 8x4 READ !!!
FNV1A_YoshimitsuTRIADii:    (64MB block); 65536MB hashed in 14399 clocks or  4.551MB per clock !!! INTERLEAVED !!!
FNV1A_YoshimitsuTRIAD:      (64MB block); 65536MB hashed in 15912 clocks or  4.119MB per clock !!! NON-INTERLEAVED !!!
CRC32_SlicingBy8:           (64MB block); 65536MB hashed in 71588 clocks or  0.915MB per clock

Share this post


Link to post
Share on other sites

1) Even if you divide the L1 cache bandwidth scores by the number of cores, it's not really possible to compare Haswell AIDA64 v3.00 L1 cache results against results obtained by previous releases, mostly because only AIDA64 v3.00 utilizes AVX instructions.  Without AVX it's not possible to properly measure Haswell L1 cache bandwidth.

 

2) 5400 vs. 4699 MB/s: you need to use XMM or YMM registers to properly utilize the cache and memory subsystem capabilities of modern Intel processors.

 

3) According to Intel's optimization guide, the bank conflict issue of Sandy Bridge and Ivy Bridge has been eliminated in Haswell.

 

4) "I hate the fact that cannot play with new toys myself" -- In order to develop kick-ass benchmark methods, you simply need to get the hardware you optimize for :(  Without letting you to try every combinations on an actual CPU, it's very tough to come up with extreme optimizations.

 

Hope this helps to let you dig deeper into Haswell ;)

Share this post


Link to post
Share on other sites

... letting you to try every combinations on an actual CPU ...

 

Thanks,

that's what I want to know L1 cache loading speed i.e. read via:

- general purpose 32bit registers;

- general purpose 64bit registers;

- XMM registers;

- YMM registers.

 

My suggestion, as an Everest fan, is to "retain" the single-threaded report along with the new 'modern' one.

 

I still cannot compare e.g. i7 4770K stock with my Core 2, allow me one last question:

What benchmarker do you use to obtain Everest-like L1 cache read speeds?

 

http://www.youtube.com/watch?v=kfLUBOW-yRc

Similarly to the above shown World Record 432km/h it is useful to know the 'BANDWIDTH' of the car but the need for speed is of all kinds!

As you can see at the end where speed is reduced from 400 to 200, the miserable 200km/h look/feel CASUAL - but this is the most needed diapason - I mean YMM are overkill for certain (e.g. superfast hashing) tasks - they impose some limitations (latency AFAIK) i.e. there is a price to be paid, in BUGATTI example: acceleration time, fuel consumption, vibrations, ...

Regards.

Share this post


Link to post
Share on other sites

My suggestion, as an Everest fan, is to "retain" the single-threaded report along with the new 'modern' one.

I still cannot compare e.g. i7 4770K stock with my Core 2

 

 

Single-threaded memory benchmarking is obsolete, and retaining them next to the multi-threaded ones would just cause more confusion.

 

You can still compare i7-4770K against Core 2 using the new multi-threaded benchmarks.

 

What benchmarker do you use to obtain Everest-like L1 cache read speeds?

 

You can still use the old single-threaded AIDA64 cache & memory benchmarks if you keep a copy of AIDA64 v2.85.

 

Share this post


Link to post
Share on other sites

Thanks, the problem is my misery, I relied on forums to see how new and old CPUs fare.
There is an old Bulgarian proverb:
"The wolf's neck is tick because he does the/his work himself."

Good luck

Share this post


Link to post
Share on other sites

Yo guys,

do ya have a quick hint how I can improve the read, write and copy performance? What is to particular do?

You've already got some decent scores there. Usually, the way to improve the scores is using more agressive memory timings, and/or increasing the memory clock, and/or increasing the CPU core clock, and/or increasing the uncore (integrated memory controller block of the CPU) clock.

Share this post


Link to post
Share on other sites

Well, I thought increasing the memory timing should improve the scores. The story is 14-15-15-35 made the scores about 5 K worse.

How can I improve the latency of the L3 Cache

When you alter a single memory timing setting, the memory controller may still adjust some other timings automatically, to keep your system stable.

L3 cache latency can be improved by increasing the L3 cache clock (uncore clock).

Share this post


Link to post
Share on other sites

My uncore clock is already at 4500 MHz. On the benchmark at #post 15 the L3 latency is 14,7 ns and it seams like it is getting worse every day. Today the latency is 23,7 ns. Why is it getting worse by using the same BIOS settings? :(  

Share this post


Link to post
Share on other sites

I used to have CacheMem results like shown in post 15.

However on my

Rampage IV Extreme Black Edition

4930K

4x4gb Corsair Dominator Platinum 2133 C9

The read bandwidth strangly is very low .

Tried other sticks: 4x4gb G.Skill TridentX 2400 C10 .. same result.

Tried different versions of Aida , saw the same thing at my friends system.

Any clues?

Share this post


Link to post
Share on other sites

I used to have CacheMem results like shown in post 15.

However on my

Rampage IV Extreme Black Edition

4930K

4x4gb Corsair Dominator Platinum 2133 C9

The read bandwidth strangly is very low .

Tried other sticks: 4x4gb G.Skill TridentX 2400 C10 .. same result.

Tried different versions of Aida , saw the same thing at my friends system.

Any clues?

It's not easy to diagnose such issues without any specific numbers ;) Please let me know the clock configuration for both your CPU and IMC (memory), and also the Memory Read, Memory Write and Memory Copy scores you've got via AIDA64.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×