Project Notes: microarchitecture

Showing posts with label microarchitecture. Show all posts

Wednesday, November 30, 2016

Measure Twice, Cut Once

In a previous post, I observed that, as I increased the number of cores in my GPGPU, performance began to plateau and hardware threads spent more time stalled because their store queue was full. I speculated that the latter might cause the latter, although that wasn't definitive. The current implementation only has a single store queue entry for each thread. One optimization I've been considering is adding more store queue entries, but this has subtle and complex design tradeoffs.

Automatic Cache Prefetching

Many optimizations take the form of "lazy evaluation." When an operation may end up not being needed, we defer it as long as possible for the chance of avoiding it entirely. However, there is another class of optimizations which attempt to do work speculatively before we know that we need to. At first blush, it seems like there are a few conditions where it might be advantageous to do this:

1. If the operation has a lot of latency.
2. If the resource that is needed for the operation tends to be underutilized.
3. The operation ends up being needed most of the time anyway.

I've experimented with automatically prefetching cache lines to improve performance in my GPGPU design. This is a common technique in many computer architectures. When a cache miss occurs, it will automatically read a few of following cache lines.

Improved 3D engine profile

I reworked the 3D renderer recently, improving performance and adding features to make it more general purpose. This included improved bin assignment to reduce overhead, tracking state transitions properly, and adding features like clipping.

Write-back vs. write through bake-off

In the last update, I was debating whether the complexity of a write-back cache was justified by the performance increase. I ran a little bake-off between the version 1 microarchitecture, which uses a write through L1 cache, and the in-progress version 2 using a write-back L1 cache. There are many other differences between the architectures which would affect performance, but I'm going to focus on memory access patterns, which should be similar. I used the same teapot benchmark as in previous posts, as it has a nice mix of memory accesses. Both versions are running one core, so there are aspects of cache coherence traffic that aren't captured, but I'll get to that in a minute.

Messy Details

A problem with books and papers about computer architecture is that they gloss over messy details. The block diagram looks so simple, but, when you begin to implement it, you realize there are fundamental structural hazards. They give no hints about how to handle them. Often, a subtle detail completely alters the design. I'm going to talk about a lot of details around the cache hierarchy, specifically the way the L1 and L2 caches communicate with each other, and a design design with the current microarchitecture I am implementing.

Keeping score

I recently embarked on a complete redesign of the microarchitecture for the GPGPU I've been working on, with a major goal being to increase the clock frequency. The previous version had a maximum frequency of around 30 Mhz when synthesized for my Cyclone IV FPGA board, constrained by some long combinatorial logic paths. One way to increase clock speed is to break stages apart, making the pipeline deeper. This, however, is not without tradeoffs. it increases the latency of each operation (in terms of clock cycles), and introduces new pipeline hazards. Combined together, these can decrease the performance of the CPU. I've attempted to mitigate this by utilizing a technique used by a number of modern GPUs (which was in-turn borrowed from early out-of-order microprocessors, although in this context it is used for in-order issue).

Branch Prediction

I've been designing a simple GPGPU in my spare time. I recently implemented branch prediction, but when I ran the small suite of benchmarks I had written for it, I found that it only improved performance by a few percent. This may seem a bit puzzling at first blush because the benefits of branch prediction are well known. Was there a bug in my implementation? As it turns out, the answer is no, but the reason why is interesting.

Project Notes

Wednesday, November 30, 2016

Measure Twice, Cut Once

Saturday, February 28, 2015

Automatic Cache Prefetching

Saturday, February 21, 2015

Improved 3D engine profile

Saturday, July 5, 2014

Write-back vs. write through bake-off

Friday, July 4, 2014

Messy Details

Tuesday, May 27, 2014

Keeping score

Sunday, November 11, 2012

Branch Prediction