The cache statistics from the 2 MiB and 4 MiB simulations are given below. The number of misses to the L3 is only slightly higher with the 4 MiB L3 than it is with the 2 MiB L3. The difference of 15 misses (a 0.064% increase) does not seem to be large enough to account for the difference in run time. The data also shows that the number of accesses to the L3 has dropped noticeably with the increase in cache size (by 9245 accesses or 4.4%). This would suggest that the number of misses to the L2 has increased, which is indeed the case (by 4517 misses or 3.6%). This is a bit counterintuitive, as it seems that a change in the L3 size should not affect the L2 activity. Likewise, we see a change in the L1 instruction cache activity also. The only quantity that appears to remain constant is the number of accesses to the L1 data cache.
From a question I asked on the Sniper mailing list in May, it has become clear that the way that cache stats are counted is not as simple as one might first imagine. It is important to take such points into account when reviewing this data.
There is a post on the Sniper mailing list regarding a similar problem in which someone observed decreasing performance with increasing L2 size. However, this case involved a parallel application. According to that discussion, it seems that parallelism was a root cause of the unexpected behavior. In my case, however, I am running a single, sequential application.
2MiB L3:
Execution Time (ns) 13050531
Cache L1-I
num cache accesses 3430983
num cache misses 3105
Cache L1-D
num cache accesses 14697313
num cache misses 134331
Cache L2
num cache accesses 228931
num cache misses 126219
Cache L3
num cache accesses 211220
num cache misses 23432
4MiB L3:
Execution Time (ns) 14356859
Cache L1-I
num cache accesses 3440668
num cache misses 3337
Cache L1-D
num cache accesses 14697313
num cache misses 134352
Cache L2
num cache accesses 229190
num cache misses 121702
Cache L3
num cache accesses 201975
num cache misses 23447
Update: (3 July) I received a response from one of the Sniper developers. He recommended setting the "traceinput/address_randomization" configuration parameter to "false". I did so, and the new results show the same execution time for all cases where the L3 cache was set to a size larger than 1 MiB. The new results are shown in the figure below.