Project Notes: profiling

Showing posts with label profiling. Show all posts

Wednesday, November 30, 2016

Measure Twice, Cut Once

In a previous post, I observed that, as I increased the number of cores in my GPGPU, performance began to plateau and hardware threads spent more time stalled because their store queue was full. I speculated that the latter might cause the latter, although that wasn't definitive. The current implementation only has a single store queue entry for each thread. One optimization I've been considering is adding more store queue entries, but this has subtle and complex design tradeoffs.

Not so fast

I wrote a custom engine to render Quake levels with my GPGPU. After fixing many subtle hardware lockups and compiler backend gremlins, I'm right chuffed to see this running reliably on an FPGA. It stresses a lot of functionality: hardware multithreading, heavy floating point math, and complex data structure accesses.

I won't be challenging anyone to a deathmatch any time soon

But... it's running at around 1 frame per second. While this is only a single core running at 50 Mhz, the original Quake with a software renderer ran fine on a 75 Mhz Pentium, so there's a lot of room for improvement. I'll dig more into the performance in a bit, but first, some background on how this works.

Automatic Cache Prefetching

Many optimizations take the form of "lazy evaluation." When an operation may end up not being needed, we defer it as long as possible for the chance of avoiding it entirely. However, there is another class of optimizations which attempt to do work speculatively before we know that we need to. At first blush, it seems like there are a few conditions where it might be advantageous to do this:

1. If the operation has a lot of latency.
2. If the resource that is needed for the operation tends to be underutilized.
3. The operation ends up being needed most of the time anyway.

I've experimented with automatically prefetching cache lines to improve performance in my GPGPU design. This is a common technique in many computer architectures. When a cache miss occurs, it will automatically read a few of following cache lines.

Improved 3D engine profile

I reworked the 3D renderer recently, improving performance and adding features to make it more general purpose. This included improved bin assignment to reduce overhead, tracking state transitions properly, and adding features like clipping.

Write-back vs. write through bake-off

In the last update, I was debating whether the complexity of a write-back cache was justified by the performance increase. I ran a little bake-off between the version 1 microarchitecture, which uses a write through L1 cache, and the in-progress version 2 using a write-back L1 cache. There are many other differences between the architectures which would affect performance, but I'm going to focus on memory access patterns, which should be similar. I used the same teapot benchmark as in previous posts, as it has a nice mix of memory accesses. Both versions are running one core, so there are aspects of cache coherence traffic that aren't captured, but I'll get to that in a minute.

Waiting in line...

I updated the simple 3d engine running on the GPGPU I've been hacking on to use a more sophisticated runtime. The original incarnation of the engine used a single strand to perform all of the work serially. That won't scale up to multiple cores and performs suboptimally as utilizing hardware multi-threading is essential to get good performance on this architecture.

Project Notes

Wednesday, November 30, 2016

Measure Twice, Cut Once

Friday, June 5, 2015

Not so fast

Saturday, February 28, 2015

Automatic Cache Prefetching

Saturday, February 21, 2015

Improved 3D engine profile

Saturday, July 5, 2014

Write-back vs. write through bake-off

Friday, November 30, 2012

Waiting in line...