Saturday, February 28, 2015

Automatic Cache Prefetching

Many optimizations take the form of "lazy evaluation."  When an operation may end up not being needed, we defer it as long as possible for the chance of avoiding it entirely. However, there is another class of optimizations which attempt to do work speculatively before we know that we need to. At first blush, it seems like there are a few conditions where it might be advantageous to do this:

 1. If the operation has a lot of latency.
 2. If the resource that is needed for the operation tends to be underutilized.
 3. The operation ends up being needed most of the time anyway.

I've experimented with automatically prefetching cache lines to improve performance in my GPGPU design. This is a common technique in many computer architectures. When a cache miss occurs, it will automatically read a few of following cache lines.


I've implemented an the prefetch mechanism in the L2 cache for simplicity.  This means that accesses to prefetched lines may still cause an L1 miss. However, since L1 misses have relatively low latency compared to external memory accesses, this shouldn't matter much. I believe the following analysis supports this conclusion.

The first stage in the L2 cache pipeline is a convenient place to put the prefetch mechanism: it can snoop cache misses as they flow by, and can inject memory requests when no other requests are pending.

The two bolded chunks of SystemVerilog code below represent the prefetch logic that has been added to l2_cache_arb. The first chunk detects when prefetching should be started after a cache miss.  If another cache miss occurs while a prefetch is already active (which is not uncommon), it will be ignored.

The second chunk injects prefetch requests into the pipeline. This clause is only active if there aren't requests from cores or restarted fills. Prefetches are opportunistic, so this ensures they don't starve normal requests.

    if (l2bi_ready)
        // Restarted request from external bus interface

       if (!prefetch_active && l2bi_request.packet_type != L2REQ_PREFETCH)
            prefetch_active <= 1;
            prefetch_address <= l2bi_request.address + `CACHE_LINE_BYTES;
    else if (|grant_oh && can_accept_request)
        // New request from a core
    else if (prefetch_active && can_accept_request)
        // Inject prefetch commands
        prefetch_address <= prefetch_address + `CACHE_LINE_BYTES;
        prefetch_count <= prefetch_count + 1;
        if (prefetch_count == PREFETCH_COUNT - 1)
            prefetch_active <= 0;

        l2a_request.valid <= 1; 
        l2a_request.packet_type <= L2REQ_PREFETCH;
        l2a_request.address <= prefetch_address;
        l2a_is_l2_fill <= 0;


Being an optimization, this is the sort of feature that can appear to work correctly, but not actually be doing anything. After running regression tests to make sure this change didn't break normal functionality, I attempted to prove that this was indeed prefetching as intended.  My first step was to instrument the code with some $display statements.  In the output, I can see the prefetches being injected and completing:

filled cache miss 00200000
inserting prefetch 00200040
inserting prefetch 00200080
inserting prefetch 002000c0
inserting prefetch 00200100
filled prefetch 00200040
filled prefetch 00200080
filled prefetch 002000c0
filled prefetch 00200100

This looks correct.  Next, I ran a single threaded memory read test.  Here are the results:

No PrefetchPrefetch
Total Cycles846,091574,293
l2 miss16,3904,688
l2 hit314,317

This looks correct.  Prefetching makes this program run 32% faster (looking at total cycles). The L2 cache miss rate has decreased from 99.98% to 24%.

Things Get More Complicated

This design is intended to run programs that are heavily multithreaded, so the case above is not representative of typical workloads. The next step is to run the multithreaded read benchmark. This test uses four hardware threads to read interleaved chunks of memory, and uses vector loads to read an entire cache line at a time.

No PrefetchPrefetch
Total Cycles459,914467,098
l2 miss16,3949,005
l2 hit58,996

The L2 miss rate has decreased with prefetch enabled, which is what we would expect.  However, this is a dead heat, with the prefetch being about 1.5% slower.  Based on previous posts, that shouldn't be a surprise.  Hardware multithreading is already hiding latency, so the prefetch mechanism is redundant.

The next case I tried is the write benchmark.  It uses vector wide writes and multiple threads as above.

No PrefetchPrefetch
Total Cycles462,041721,704
l2 miss16,3809,222
l2 hit28,488

Although the L2 cache miss rate has decreased here as well, the performance is around 55% slower. This occurs because of a bad interaction with a hardware optimization.

When a cache miss occurs writing a 32-bit word to memory, the hardware first loads the entire cache line (64 bytes) into the L2 cache. It then updates the 32-bit word that was written in that cache line and marks the line dirty.  It will be written back to memory when that line needs to be evicted to make room for other lines. However, this architecture allows an entire vector to be stored at an aligned address with a single instruction. Vector registers are the same size as a cache line (this is not a coincidence).  In the case where a vector write misses the cache, loading the line from memory first would be a waste, since it's going to be entirely overwritten anyway. There is logic to detect this condition and skip the memory load. The write benchmark is using vector writes.

However, automatic cache prefetching sabotages this optimization by prefetching those lines anyway. This is demonstrated by adding a little Verilog code to count the number of external memory references:

No PrefetchPrefetch
External Memory Reads58453

A More Realistic Workload

The tests so far show that the performance of this feature varies wildly.  In the best case, it improved performance by 32%, but in the worst, it degraded it by 55%.  It also seems to be worse with more highly optimized code.  However, these are all very specific microbenchmarks.  Let's grab some numbers from a more mixed test case, the 3D rendered textured cube from previous posts:

No PrefetchPrefetch
Total Cycles17,808,76120,742,364
l2 miss102,05863,684
l2 hit307,285348,439

As before, prefetch reduces the l2 miss rate, but the performance hit is pretty significant: 16% slower with prefetch enabled.


Overall, this feature seems to be detrimental to performance. Why? I initially speculated that there are three rules that define when speculative work is beneficial.  Rule 1 states there needs to be high latency. However, one could argue that hardware threading has already hidden this latency and thus it is no longer true.  We didn't really measure memory interface utilization in this test, although it's probably safe to infer that hardware multithreading also improves memory utilization. Rule 3 states that the operation should occur most of the time anyway. In the most degenerate benchmarks I ran, this is not true: the vector write optimization skips the operation often in some cases.

There are other experiments I could have run.  For example, it might be interesting to disable the vector write optimization to see if there was a difference between prefetch and no-prefetch modes--which would support the theory that the interaction with that optimization is primarily penalizing the mixed use case. I also could have changed the number of prefetched cache lines to see if that made a difference.  However, the basic numbers seemed bad enough that it's probably not worth it.


  1. What about a post modify load instruction?
    It works like this:
    load reg,base,imm
    It loads memory pointed by base into reg and then adds imm to base
    It can be used in situations like this
    This instruction can save an additional add and the cpu can safely prefetch data because it used mostly when you access data linearly

    1. EDIT:this example of use is stores data

    2. Yeah, but I see what you're getting at. I had thought about automatic post increment of pointers (like ARM), but using it to do prefetch is an interesting idea that I hadn't considered.

      The reason I originally didn't implement automatic pointer increment is that I couldn't figure out how to do it without an extra write port on the register file (in the case of a load). It would be interesting to do some modeling with automatic post-increment instructions to see the overall impact on performance. This would probably help in terms of total instruction count, but I'd predict that the improvement from the prefetch would still be masked by hardware threading.

      Another experiment to add to the list. :)

  2. As I understand the code above prefetches the next cache line which is not far away enough.Can you modify it to prefetch something like 4 cache lines away?

    1. It prefetches the next four contiguous cache lines. Were you proposing fetching more cache lines, or skipping some number of cache lines?

    2. I think there is no real benefit from prefetching the next cache line.So maybe prefetch the forth or eighth (skipping the previous) cache line.

    3. Why do you think there is no benefit for prefetching the next cache line?

    4. Actually the benefit is small.This is because a memory load is very high latency about 100 cycles.So in order for the data to be in the l2 cache when they are needed,the cpu must start loading them many cycles in advance.If the cpu starts loading the data only few cycles before needed,it will make memory load only few cycles shorter.Also prefetching has many side effects, like filling the cache with unrelated data ,which can make the cpu slower if the benefit for doing it is small.

    5. That's a good point. I did observe L2 misses coming in for a line that was already in the process of being prefetched a number of times. It depends on the access pattern. For a memory copy, I would certainly expect this to be true. However, there are plenty of cases where adjacent accesses won't be a few cycles apart. Instruction cache accesses are an example. An instruction cache line holds 16 instructions. Assuming no branches, and all threads are ready, then each thread issues every fourth cycle, so it would take 64 cycles to execute all instructions in a cache line. Most floating point operations have 7 cycles of latency, so data dependencies will increase this. A floating point divide has close to 50 cycles of latency.

      In most cases I tested, prefetch lowered the L2 miss rate significantly, so even the naive algorithm does alright. The problem is that threads already do an effective job of hiding that latency (masking the prefetch improvements), and it ended up prefetching lines that didn't need to be loaded. I would expect the probability of that a prefetch will fetch the right line decreases as you get farther away from the location of the original fault.