i7-5775C L4 cache performance

Sanmayce · June 11, 2015

Hi guys,
can anyone clarify the huge latency (as I see it) of L4 cache obtained on that system:

My dummy expectations were for 20ns, why 42ns?
In my view L4 performs poorly - only 20ns better than RAM.

Sanmayce · June 11, 2015

Those 128MB are interesting, my goal is to see how well L4 helps compression and mostly decompression when using big blocks.

These guys showed that it helps, but they didn't say what the dictionary size was and what method they used.
With introducing L4 I wonder is AIDA64 ready to show similar scenario, I mean Zlib uses only 32KB, is there an idea of adding some heavy compressor to the benchmark roster?

Sanmayce · June 11, 2015

Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting.
Whether a sorting algorithm or an actual compressor using BWT I see no big difference.

I found this thread giving some flavor of what we can expect:

http://techreport.com/forums/viewtopic.php?f=2&t=93911#p1206147

Code Name            Product                  Cores  Threads  Clocks    Power   Release Date   Notes

Crystal Lake?        Core i7 Extreme Edition  4      16       3.4 Ghz   95W     2016Q2         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked
Crystal Lake refresh Core i7 Extreme Edition  4      16       3.5 Ghz   95W     2016Q4         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked
Cannon Lake          Core i7 Extreme Edition  6      24       3.6 Ghz   87W     2017Q3         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked

Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4?

Fiery · June 12, 2015

Hi guys,

can anyone clarify the huge latency (as I see it) of L4 cache obtained on that system:

My dummy expectations were for 20ns, why 42ns?

In my view L4 performs poorly - only 20ns better than RAM.

That is normal, and it's much better than the first generation eDRAM solution. Core i7-4770R: L4 cache latency = 76.2 ns.

Our ZLib data compression benchmark uses the "best" compression method, so it's more CPU bound than memory/cache interface bound. It is not designed to show off the underlying cache architecture performance.

Our PhotoWorxx image manipulation benchmarks should be better to reflect the performance gains you can get from the L4 cache, but you may need to limit the number of used CPU cores to 3 or 4 if you really want to see a big difference between Haswell and Broadwell-H.

Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting.

Whether a sorting algorithm or an actual compressor using BWT I see no big difference.

I'm afraid we have no plans about implementing a sorting benchmark.

Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4?

AIDA64 benchmarks are not designed to show off the performance gain coming from a particular hardware feature. So we will not develop a new benchmark method just to show off how great 128MB or 256MB L4 is. And actually, in general (non-gaming) use the L4 cache doesn't provide huge gains. BTW, some rumours state that with Skylake and Cannonlake the eDRAM will no longer act as a L4 cache, but more as a dedicated buffer (cache) for the iGPU. I'm not sure if that's true, it would be surprising if Intel made such a move.

Sanmayce · June 12, 2015

... It is not designed to show off the underlying cache architecture performance.

That's what I had as a suggestion.

Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario.

Fiery · June 12, 2015

That's what I had as a suggestion.

Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario.

For that purpose the AIDA64 PhotoWorxx benchmark may be the solution we can offer at this time. It is heavily multi-threaded, it uses integer code, and it utilizes all cache levels.

Sanmayce · June 12, 2015

I see, it really serves well, in a way it is even better than sorting, I myself wrote in 16bit assembly picture rotator/viewer called 'Otane'. It stresses very heavily the RAM block needed to house the matrix.
By the way, is your picture big enough to stress L4, this month with help of some fellows (owning 5775C) I want to see the impact of L4 - it interests me a lot compressionwise. I intend to stress 256MB block with a BWT/LZ decompression. I will ask them to run PhotoWorxx and will compare the results with decompression ones.

Fiery · June 12, 2015

Yes, the picture size used by the PhotoWorxx benchmark means that up to 384MB memory is utilized on a 8-thread CPU, so it's good enough to stress the L4 cache, even when its size is increased to 256MB in the future.

Sanmayce · June 12, 2015

Thanks, just one more question, according to my observation I see 2x speed increase coming only from the doubled RAM bandwidth, is it so? I didn't expect such a huge impact coming only from the bandwidth, I thought that the main bottleneck would be latency, very strange (I see 12GigaPixels/s may be 12GB/s or 36GB/s), could you clarify why that happens. Is PhotoWorxx bandwidth bound?

Fiery · June 13, 2015

It depends on the actual bandwidth and latency performance of the CPU, but generally speaking, PhotoWorxx is more bandwidth bound than latency bound. And about your scores: the 2x increase is not really 2x, and the bandwidth difference isn't actually 2x either The reference FX-8350 system has got 2.3x more bandwidth than your tested system, and it obtained 1.84x of your score.

Sanmayce · June 25, 2015

Hi again,

can you post a snapshot of highest Photoworxx scores you know of?

I searched and found this site:

http://amdfx.blogspot.com/2012/01/aida64-benchmarks-windows-7-fx-patch.html

I found 94,910:

Also, please explain what causes such suspiciously low score on 5775c (here):

AMD 'Zambezi' to be 3x faster, something here seems not as it should.

Sanmayce · June 25, 2015

Ouch, just saw that the above screenshots show results e.g. for 4x Phenom II 20352 for the AIDA64 EE while the other AIDA64 EE shows for the same computer 5634Mpixels/s, didn't know that two different metrics were in use, how so?

Balala · June 26, 2015

With AIDA64 v2.70 we revamped the Photoworxx

Fiery · June 26, 2015

One of our server test systems can push 62186 MPixel/s in the current PhotoWorxx benchmark. The system is based on two Xeon E5-2660 v3 "Haswell-EP" processors (20 CPU cores total), and use an 8-channel DDR4-1866 memory configuration.

Sanmayce · June 26, 2015

With AIDA64 v2.70 we revamped the Photoworxx

Can we say that since v2.70 we can compare those Mpixels/s safely? For example, one fellow member one overclock.net forum shared this:

I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186.

My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x?

I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput?

Fiery · June 28, 2015

Can we say that since v2.70 we can compare those Mpixels/s safely?

Yes, we can.

For example, one fellow member one overclock.net forum shared this:

I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186.

My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x?

I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput?

That 20-core test system of ours is bottlenecked by the memory subsystem. Even though the 20-core system has got let's say 4.7x more processing power, it's only got 1.75x more memory bandwidth than the system your screen shot was made on. Speedup is only linear in such benchmarks that do not rely heavily on the memory subsystem, like FPU Julia.

Sanmayce · June 28, 2015

Thanks for the clarification.

My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture).

Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention.

I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck.

Fiery · June 28, 2015

Thanks for the clarification.

My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture).

Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention.

I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck.

Zen will have a hard time competing against Broadwell-EP, which will also debut 2 or 3 quarters before Zen. We'll see

Sign In

i7-5775C L4 cache performance

Recommended Posts

Sanmayce

Sanmayce

Sanmayce

Fiery

Sanmayce

Fiery

Sanmayce

Fiery

Sanmayce

Fiery

Sanmayce

Sanmayce

Balala

Fiery

Sanmayce

Fiery

Sanmayce

Fiery

Join the conversation

Support

Online Store

Browse

Activity