Project Notes: gpgpu

Showing posts with label gpgpu. Show all posts

Friday, June 5, 2015

Not so fast

I wrote a custom engine to render Quake levels with my GPGPU. After fixing many subtle hardware lockups and compiler backend gremlins, I'm right chuffed to see this running reliably on an FPGA. It stresses a lot of functionality: hardware multithreading, heavy floating point math, and complex data structure accesses.

I won't be challenging anyone to a deathmatch any time soon

But... it's running at around 1 frame per second. While this is only a single core running at 50 Mhz, the original Quake with a software renderer ran fine on a 75 Mhz Pentium, so there's a lot of room for improvement. I'll dig more into the performance in a bit, but first, some background on how this works.

Improved 3D engine profile

I reworked the 3D renderer recently, improving performance and adding features to make it more general purpose. This included improved bin assignment to reduce overhead, tracking state transitions properly, and adding features like clipping.

Mip-Mapping

I added mipmapping support to the software renderer for my GPGPU project.

Write-back vs. write through bake-off

In the last update, I was debating whether the complexity of a write-back cache was justified by the performance increase. I ran a little bake-off between the version 1 microarchitecture, which uses a write through L1 cache, and the in-progress version 2 using a write-back L1 cache. There are many other differences between the architectures which would affect performance, but I'm going to focus on memory access patterns, which should be similar. I used the same teapot benchmark as in previous posts, as it has a nice mix of memory accesses. Both versions are running one core, so there are aspects of cache coherence traffic that aren't captured, but I'll get to that in a minute.

Messy Details

A problem with books and papers about computer architecture is that they gloss over messy details. The block diagram looks so simple, but, when you begin to implement it, you realize there are fundamental structural hazards. They give no hints about how to handle them. Often, a subtle detail completely alters the design. I'm going to talk about a lot of details around the cache hierarchy, specifically the way the L1 and L2 caches communicate with each other, and a design design with the current microarchitecture I am implementing.

Faster than light

In the last post, I discussed the fastest possible execution time of a test program, the "speed of light," as it were. There was an important assumption in this calculation: that the CPU issued one instruction per cycle. A common technique to improve performance in modern processors is issuing multiple instructions in the same cycle, also known as superscalar. I hadn't put much thought into a superscalar design given the focus on utilizing the wide vector unit. However, some helpful comments in a previous post have led me to reconsider this decision.

Keeping score

I recently embarked on a complete redesign of the microarchitecture for the GPGPU I've been working on, with a major goal being to increase the clock frequency. The previous version had a maximum frequency of around 30 Mhz when synthesized for my Cyclone IV FPGA board, constrained by some long combinatorial logic paths. One way to increase clock speed is to break stages apart, making the pipeline deeper. This, however, is not without tradeoffs. it increases the latency of each operation (in terms of clock cycles), and introduces new pipeline hazards. Combined together, these can decrease the performance of the CPU. I've attempted to mitigate this by utilizing a technique used by a number of modern GPUs (which was in-turn borrowed from early out-of-order microprocessors, although in this context it is used for in-order issue).

Waiting in line...

I updated the simple 3d engine running on the GPGPU I've been hacking on to use a more sophisticated runtime. The original incarnation of the engine used a single strand to perform all of the work serially. That won't scale up to multiple cores and performs suboptimally as utilizing hardware multi-threading is essential to get good performance on this architecture.

Branch Prediction

I've been designing a simple GPGPU in my spare time. I recently implemented branch prediction, but when I ran the small suite of benchmarks I had written for it, I found that it only improved performance by a few percent. This may seem a bit puzzling at first blush because the benefits of branch prediction are well known. Was there a bug in my implementation? As it turns out, the answer is no, but the reason why is interesting.

Project Notes

Friday, June 5, 2015

Not so fast

Saturday, February 21, 2015

Improved 3D engine profile

Sunday, December 21, 2014

Mip-Mapping

Saturday, July 5, 2014

Write-back vs. write through bake-off

Friday, July 4, 2014

Messy Details

Friday, June 6, 2014

Faster than light

Tuesday, May 27, 2014

Keeping score

Friday, November 30, 2012

Waiting in line...

Sunday, November 11, 2012

Branch Prediction