Jump to content

i7-5775C L4 cache performance


Recommended Posts

Those 128MB are interesting, my goal is to see how well L4 helps compression and mostly decompression when using big blocks.

Intel_Broadwell_diags6.png

 

These guys showed that it helps, but they didn't say what the dictionary size was and what method they used.
With introducing L4 I wonder is AIDA64 ready to show similar scenario, I mean Zlib uses only 32KB, is there an idea of adding some heavy compressor to the benchmark roster?
 

Link to comment
Share on other sites

Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting.
Whether a sorting algorithm or an actual compressor using BWT I see no big difference.

I found this thread giving some flavor of what we can expect:

http://techreport.com/forums/viewtopic.php?f=2&t=93911#p1206147
 

Code Name            Product                  Cores  Threads  Clocks    Power   Release Date   Notes

Crystal Lake?        Core i7 Extreme Edition  4      16       3.4 Ghz   95W     2016Q2         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked
Crystal Lake refresh Core i7 Extreme Edition  4      16       3.5 Ghz   95W     2016Q4         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked
Cannon Lake          Core i7 Extreme Edition  6      24       3.6 Ghz   87W     2017Q3         New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked

Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4?

Link to comment
Share on other sites

Hi guys,

can anyone clarify the huge latency (as I see it) of L4 cache obtained on that system:

 

My dummy expectations were for 20ns, why 42ns?

In my view L4 performs poorly - only 20ns better than RAM.

That is normal, and it's much better than the first generation eDRAM solution. Core i7-4770R: L4 cache latency = 76.2 ns.

Our ZLib data compression benchmark uses the "best" compression method, so it's more CPU bound than memory/cache interface bound. It is not designed to show off the underlying cache architecture performance.

Our PhotoWorxx image manipulation benchmarks should be better to reflect the performance gains you can get from the L4 cache, but you may need to limit the number of used CPU cores to 3 or 4 if you really want to see a big difference between Haswell and Broadwell-H.

 

Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting.

Whether a sorting algorithm or an actual compressor using BWT I see no big difference.

I'm afraid we have no plans about implementing a sorting benchmark.

 

Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4?

AIDA64 benchmarks are not designed to show off the performance gain coming from a particular hardware feature. So we will not develop a new benchmark method just to show off how great 128MB or 256MB L4 is. And actually, in general (non-gaming) use the L4 cache doesn't provide huge gains. BTW, some rumours state that with Skylake and Cannonlake the eDRAM will no longer act as a L4 cache, but more as a dedicated buffer (cache) for the iGPU. I'm not sure if that's true, it would be surprising if Intel made such a move.

Link to comment
Share on other sites

... It is not designed to show off the underlying cache architecture performance.

That's what I had as a suggestion.

Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario.

Link to comment
Share on other sites

That's what I had as a suggestion.

Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario.

For that purpose the AIDA64 PhotoWorxx benchmark may be the solution we can offer at this time. It is heavily multi-threaded, it uses integer code, and it utilizes all cache levels.

Link to comment
Share on other sites

I see, it really serves well, in a way it is even better than sorting, I myself wrote in 16bit assembly picture rotator/viewer called 'Otane'. It stresses very heavily the RAM block needed to house the matrix.
By the way, is your picture big enough to stress L4, this month with help of some fellows (owning 5775C) I want to see the impact of L4 - it interests me a lot compressionwise. I intend to stress 256MB block with a BWT/LZ decompression. I will ask them to run PhotoWorxx and will compare the results with decompression ones.
 

59452634eade4b66.jpg

Link to comment
Share on other sites

Yes, the picture size used by the PhotoWorxx benchmark means that up to 384MB memory is utilized on a 8-thread CPU, so it's good enough to stress the L4 cache, even when its size is increased to 256MB in the future.

  • Like 1
Link to comment
Share on other sites

Thanks, just one more question, according to my observation I see 2x speed increase coming only from the doubled RAM bandwidth, is it so? I didn't expect such a huge impact coming only from the bandwidth, I thought that the main bottleneck would be latency, very strange (I see 12GigaPixels/s may be 12GB/s or 36GB/s), could you clarify why that happens. Is PhotoWorxx bandwidth bound?
 

AIDA.jpg

Link to comment
Share on other sites

It depends on the actual bandwidth and latency performance of the CPU, but generally speaking, PhotoWorxx is more bandwidth bound than latency bound. And about your scores: the 2x increase is not really 2x, and the bandwidth difference isn't actually 2x either :) The reference FX-8350 system has got 2.3x more bandwidth than your tested system, and it obtained 1.84x of your score.

Link to comment
Share on other sites

  • 2 weeks later...

Hi again,

can you post a snapshot of highest Photoworxx scores you know of?
 

I searched and found this site:

http://amdfx.blogspot.com/2012/01/aida64-benchmarks-windows-7-fx-patch.html

 

I found 94,910:

 

CPU-photoworx.png

Also, please explain what causes such suspiciously low score on 5775c (here):

AIDA64-CPU-Photoworxx3.jpg

 

AMD 'Zambezi' to be 3x faster, something here seems not as it should.

Link to comment
Share on other sites

One of our server test systems can push 62186 MPixel/s in the current PhotoWorxx benchmark. The system is based on two Xeon E5-2660 v3 "Haswell-EP" processors (20 CPU cores total), and use an 8-channel DDR4-1866 memory configuration.

Link to comment
Share on other sites

 

Can we say that since v2.70 we can compare those Mpixels/s safely? For example, one fellow member one overclock.net forum shared this:

 

b872e89a_ScreenShot003.jpeg

 

I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186.

My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x?

I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput?

Link to comment
Share on other sites

Can we say that since v2.70 we can compare those Mpixels/s safely?

Yes, we can.

 

For example, one fellow member one overclock.net forum shared this:

 

b872e89a_ScreenShot003.jpeg

 

I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186.

My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x?

I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput?

That 20-core test system of ours is bottlenecked by the memory subsystem. Even though the 20-core system has got let's say 4.7x more processing power, it's only got 1.75x more memory bandwidth than the system your screen shot was made on. Speedup is only linear in such benchmarks that do not rely heavily on the memory subsystem, like FPU Julia.

Link to comment
Share on other sites

Thanks for the clarification.

My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture).

Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention.

 

I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck.

Link to comment
Share on other sites

Thanks for the clarification.

My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture).

Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention.

 

I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck.

Zen will have a hard time competing against Broadwell-EP, which will also debut 2 or 3 quarters before Zen. We'll see ;)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...