Sanmayce Posted June 11, 2015 Share Posted June 11, 2015 Hi guys,can anyone clarify the huge latency (as I see it) of L4 cache obtained on that system: My dummy expectations were for 20ns, why 42ns?In my view L4 performs poorly - only 20ns better than RAM. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 11, 2015 Author Share Posted June 11, 2015 Those 128MB are interesting, my goal is to see how well L4 helps compression and mostly decompression when using big blocks. These guys showed that it helps, but they didn't say what the dictionary size was and what method they used.With introducing L4 I wonder is AIDA64 ready to show similar scenario, I mean Zlib uses only 32KB, is there an idea of adding some heavy compressor to the benchmark roster? Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 11, 2015 Author Share Posted June 11, 2015 Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting.Whether a sorting algorithm or an actual compressor using BWT I see no big difference.I found this thread giving some flavor of what we can expect:http://techreport.com/forums/viewtopic.php?f=2&t=93911#p1206147 Code Name Product Cores Threads Clocks Power Release Date Notes Crystal Lake? Core i7 Extreme Edition 4 16 3.4 Ghz 95W 2016Q2 New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked Crystal Lake refresh Core i7 Extreme Edition 4 16 3.5 Ghz 95W 2016Q4 New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked Cannon Lake Core i7 Extreme Edition 6 24 3.6 Ghz 87W 2017Q3 New socket. SoC. Dual channel DDR4 and 24 PCIe 4.0 lanes, 256 MB of eDRAM, unlocked Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4? Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 12, 2015 Share Posted June 12, 2015 Hi guys, can anyone clarify the huge latency (as I see it) of L4 cache obtained on that system: My dummy expectations were for 20ns, why 42ns? In my view L4 performs poorly - only 20ns better than RAM. That is normal, and it's much better than the first generation eDRAM solution. Core i7-4770R: L4 cache latency = 76.2 ns. Our ZLib data compression benchmark uses the "best" compression method, so it's more CPU bound than memory/cache interface bound. It is not designed to show off the underlying cache architecture performance. Our PhotoWorxx image manipulation benchmarks should be better to reflect the performance gains you can get from the L4 cache, but you may need to limit the number of used CPU cores to 3 or 4 if you really want to see a big difference between Haswell and Broadwell-H. Allow me to suggest one additional benchmark showing one of the most needed tasks in single-thread - sorting. Whether a sorting algorithm or an actual compressor using BWT I see no big difference. I'm afraid we have no plans about implementing a sorting benchmark. Now, we have 128MB, next year hopefully 256MB L4, how are we gonna estimate the role of L4? AIDA64 benchmarks are not designed to show off the performance gain coming from a particular hardware feature. So we will not develop a new benchmark method just to show off how great 128MB or 256MB L4 is. And actually, in general (non-gaming) use the L4 cache doesn't provide huge gains. BTW, some rumours state that with Skylake and Cannonlake the eDRAM will no longer act as a L4 cache, but more as a dedicated buffer (cache) for the iGPU. I'm not sure if that's true, it would be surprising if Intel made such a move. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 12, 2015 Author Share Posted June 12, 2015 ... It is not designed to show off the underlying cache architecture performance. That's what I had as a suggestion. Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario. Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 12, 2015 Share Posted June 12, 2015 That's what I had as a suggestion. Simlpy, I wanted to see a single number reported by AIDA giving some overall impression on how the tested machine behaves in intensive REALWORLD all-caches-involved integer scenario. For that purpose the AIDA64 PhotoWorxx benchmark may be the solution we can offer at this time. It is heavily multi-threaded, it uses integer code, and it utilizes all cache levels. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 12, 2015 Author Share Posted June 12, 2015 I see, it really serves well, in a way it is even better than sorting, I myself wrote in 16bit assembly picture rotator/viewer called 'Otane'. It stresses very heavily the RAM block needed to house the matrix.By the way, is your picture big enough to stress L4, this month with help of some fellows (owning 5775C) I want to see the impact of L4 - it interests me a lot compressionwise. I intend to stress 256MB block with a BWT/LZ decompression. I will ask them to run PhotoWorxx and will compare the results with decompression ones. Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 12, 2015 Share Posted June 12, 2015 Yes, the picture size used by the PhotoWorxx benchmark means that up to 384MB memory is utilized on a 8-thread CPU, so it's good enough to stress the L4 cache, even when its size is increased to 256MB in the future. 1 Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 12, 2015 Author Share Posted June 12, 2015 Thanks, just one more question, according to my observation I see 2x speed increase coming only from the doubled RAM bandwidth, is it so? I didn't expect such a huge impact coming only from the bandwidth, I thought that the main bottleneck would be latency, very strange (I see 12GigaPixels/s may be 12GB/s or 36GB/s), could you clarify why that happens. Is PhotoWorxx bandwidth bound? Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 13, 2015 Share Posted June 13, 2015 It depends on the actual bandwidth and latency performance of the CPU, but generally speaking, PhotoWorxx is more bandwidth bound than latency bound. And about your scores: the 2x increase is not really 2x, and the bandwidth difference isn't actually 2x either The reference FX-8350 system has got 2.3x more bandwidth than your tested system, and it obtained 1.84x of your score. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 25, 2015 Author Share Posted June 25, 2015 Hi again, can you post a snapshot of highest Photoworxx scores you know of? I searched and found this site: http://amdfx.blogspot.com/2012/01/aida64-benchmarks-windows-7-fx-patch.html I found 94,910: Also, please explain what causes such suspiciously low score on 5775c (here): AMD 'Zambezi' to be 3x faster, something here seems not as it should. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 25, 2015 Author Share Posted June 25, 2015 Ouch, just saw that the above screenshots show results e.g. for 4x Phenom II 20352 for the AIDA64 EE while the other AIDA64 EE shows for the same computer 5634Mpixels/s, didn't know that two different metrics were in use, how so? Quote Link to comment Share on other sites More sharing options...
Balala Posted June 26, 2015 Share Posted June 26, 2015 With AIDA64 v2.70 we revamped the Photoworxx Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 26, 2015 Share Posted June 26, 2015 One of our server test systems can push 62186 MPixel/s in the current PhotoWorxx benchmark. The system is based on two Xeon E5-2660 v3 "Haswell-EP" processors (20 CPU cores total), and use an 8-channel DDR4-1866 memory configuration. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 26, 2015 Author Share Posted June 26, 2015 With AIDA64 v2.70 we revamped the Photoworxx Can we say that since v2.70 we can compare those Mpixels/s safely? For example, one fellow member one overclock.net forum shared this: I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186. My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x? I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput? Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 28, 2015 Share Posted June 28, 2015 Can we say that since v2.70 we can compare those Mpixels/s safely? Yes, we can. For example, one fellow member one overclock.net forum shared this: I mean he uses only 6cores/6threads (4-channel DDR3-2133) and reaches 31921 half the performance of 20cores/40threads (Max Turbo Frequency 3.3 GHz) (8-channel DDR4-1866) 62186. My dummy calculation is 6x4.7GHz=28GHz while 40x3.3GHz=132GHz, why the speedup is 2x and not 132/28=4.7x? I thought that speedup is linear with adding more threads, is it because you reached the maximum bandwidth? Are those 62186 equal 60GB/s RAM throughput? That 20-core test system of ours is bottlenecked by the memory subsystem. Even though the 20-core system has got let's say 4.7x more processing power, it's only got 1.75x more memory bandwidth than the system your screen shot was made on. Speedup is only linear in such benchmarks that do not rely heavily on the memory subsystem, like FPU Julia. Quote Link to comment Share on other sites More sharing options...
Sanmayce Posted June 28, 2015 Author Share Posted June 28, 2015 Thanks for the clarification. My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture). Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention. I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck. Quote Link to comment Share on other sites More sharing options...
Fiery Posted June 28, 2015 Share Posted June 28, 2015 Thanks for the clarification. My next machine will have AMD Zen if they manage to offer 32threads and outperform 5960x (in my 16-threaded_Kazahana_vs_Wikipedia_fuzzy_search_torture). Personally, I find the RAM torturing (random accesses beyond LLC) by all available threads very inspirational - it feels like exploring the view while walking on the edge of a cliff - when one is all-attention. I fear that AMD Zen will face similar to your 40-threaded system RAM bottleneck. Zen will have a hard time competing against Broadwell-EP, which will also debut 2 or 3 quarters before Zen. We'll see Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.