Saturday, February 21, 2015

Improved 3D engine profile

I reworked the 3D renderer recently, improving performance and adding features to make it more general purpose.  This included improved bin assignment to reduce overhead, tracking state transitions properly, and adding features like clipping.

Software Renderer Design

Like previous revisions of this 3D engine,  Tiled Rendering is used.  The screen is broken into 64x64 pixel squares. These are assigned dynamically to hardware threads, which render each tile completely before moving to the next one. There are multiple tiles in-progress at a time. This video shows this in action in the emulator with four threads.

Tiled rendering has a few advantages:
  • External memory bandwidth The framebuffer is larger than the cache.  If triangles were rendered in the order they were submitted, the cache would most likely thrash. In a tiled renderer, the active set of tiles fits entirely in the cache. Each byte in the framebuffer is ideally be written to external memory only once.
  • Scaling In order to fully utilize cores/threads, the workload needs to be divided into smaller tasks. Tiles are a convenient unit of work. The advantage of this approach is that there is no need for synchronization between threads while the tile is being rendered, since the tiles are non-overlapping. If the work was divided some other way (for example, having each thread process a triangle), synchronization would be required to ensure overlapping pixels were drawn in the correct order. My earlier attempts at building a non-tiled engine showed that synchronization overhead can be significant.
Each frame is rendered in two phases. The first performs vertex and triangle setup, and the second fills the tiles.  Since every triangle that overlaps a tile is rendered before moving to the next one, the geometry phase must be complete before the pixel phase begins.

Geometry phase

  1. Fetch vertex attributes and run the vertex shader on them. Vertices are broken into groups of 16, which are dispatched dynamically to hardware threads. Each thread calls a vertex shader callback, which is vectorized and works on all 16 vertices at once.  This means there are up to 64 vertices in-flight per core (four threads times 16 vertices).
  2. Triangle setup.  This is also divided between threads on a first come, first serve basis. It is not vectorized because it deals with a lot of data structures.
    • Near plane clipping - Ensure there are no negative Z values, which would mess up perspective calculations. This operation may split triangles into multiple smaller ones.
    • Backface culling 
    • Perspective division
    • Conversion from screen space to raster coordinates
    • Binning. An axis-aligned bounding box test is used to determine which tiles a triangle potentially overlaps. Each tile has an list of triangles (actually a linked list of arrays), which is used by the next phase.  Lock-free algorithms are used to quickly allocate data structures and insert them into the tile lists.  

Pixel phase

Each thread grabs the next unrendered tile and renders all triangles in its list. It performs the following steps:
  1. Sort triangles.  Since triangles are processed in parallel, multiple threads can insert triangles into the same tile list.  This means they may not be in the order that they were originally submitted.  The thread first performs an insertion sort to address this. Note that the tile list is mostly in order, so an insertion sort is generally more efficient in this situation than many other algorithms.
  2. Clear tile to background color
  3. Rasterize triangles.  This algorithm uses half plane equations and recursively divides the tile into 4x4 grids, taking advantage of the vector unit to update the equations for all points simultaneously.  When it gets to the smallest level (4x4 pixels), it performs the following steps with each pixel represented in one lane of the vector (up to 16 at a time)
    • Depth buffer/early reject
    • Parameter interpolation - The parameters that were computed per-vertex by the vertex shader are interpolated for all pixels in-between.  This is done in a perspective correct manner.
    • Pixel shading - As with the vertex shader, this is configurable via function pointer and can be set per draw call.
    • Blend/writeback
  4. Flush tile back to system memory - loop over all overlapped cache lines and invoke flush instruction on each address.


I've generated profiles with two workloads with different performance characteristics.  The first is the teapot that I've used for many previous experiments.  It has ~2300 triangles and Lambertian shading.  The second example is a cube with bilinear filtered texture mapping. The texture is 512x512 pixels and fits in a megabyte of memory (4 bytes per pixel).  This is large for an individual texture, but simulates the common case of the texture working set for a frame being larger than the cache.  The L2 cache is configured as 128k, and the L1 instruction and data caches are each 16k.


Basic Metrics

Here is the values of internal performance counters:

Total Cycles132261836809062
store rollback count343268163497
store count1204554436648

  • The percentage of cycles for which an instruction is issued for the teapot test is about 68%.  The texture test is around 59%, ostensibly because it is spending more time waiting on texture fetches from memory.  There is an opportunity to increase performance by improving issue utilization, perhaps with more threads or other microarchitectural improvements.
  • About 11% of issued instructions for the teapot test are squashed, either because of a cache miss or branch. The figure is 27% for the texture test.
  • The L2 cache miss rate is around 6% for the teapot test and 27% for the texture test.  The latter is not surprising, because the texture is large.

Breakdown by Stage

In order to analyze the execution of these benchmarks, I used a very sophisticated profiler that is built into the compiler: I commented out chunks of code and measured the difference in execution time. :) Since each stage in the rendering pipeline depends on the results from the last, I worked backwards, tabulating total clock cycles in a spreadsheet:

A few observations:
  • In both cases, vertex shading (which includes fetching the vertex attributes from memory) appears to be negligible compared to other parts of the rendering process. Fixed function GPUs have dedicated hardware to fetch vertices.
  • Blend and writeback is a relatively small part of the overall load. This was a bit of a surprise to me. The color components computed by the shader are stored using a floating point value per channel.  There are a series of multiplies, conversions, clamps, shifts, and masks to get them packed into pixels.  I had originally considered adding specialized instructions to perform these operations more efficiently.  This step also flushes the data from the cache by looping over the tiles and issuing a flush instruction for each cache line, which seemed inefficient. I considered finding a way to accelerate this in hardware.  While it's possible that the measured overhead in this use case is lower because the entire screen isn't covered in these tests, it doesn't seem like there's a win doing much optimization here now.
  • Parameter setup and interpolation (dark red and green bars) are fairly expensive in both cases. I think these operations require a floating point pipeline, so it's not clear to me that a dedicated interpolator would have a significant area advantage over using the cores to compute them.
  • Texture fetch is very expensive, and some kind of hardware acceleration seems appropriate (the Larrabee paper stated this as well)


The intent of this architecture is to be able to get good performance by throwing a bunch of cores at the problem. I ran a render test with varying numbers of cores to determine how much performance increases as the number of cores was increased.

The number of cores will also increase the L2 working set size. I didn't want this to be a factor, but rather to focus on other microarchitectural bottlenecks, so I configured a 4MB L2 cache and 32k L1 data and instruction caches for this test. This is configured with four hardware threads per core.

The test program I used was the 'clip' test. I used it because it covers the whole screen, and thus should scale better than the tests in the previous section, which render to a small number of tiles in the middle of the screen.  The checkerboard is computed programmatically in the pixel shader and does not use a texture. The geometry for this is the inside of a cube, so the vertex calculations have low overhead.

This chart shows the relative speedup as cores are added.  The orange line shows a linear speedup for reference. 
There are most likely two reasons why performance saturates. First, there is more lock overhead.  This primarily occurs:
  • Where lock-free atomic operations are used to access shared data structures.  When the 'store synchronized' instruction that is the fundamental primitive for these operations detects another thread having updated the variable, it must loop and retry.
  • Where the threads spin at the end of a phase to wait for others to finish. Spinning in general is very bad, because it steals issue cycles from other threads.
Running the same program in the emulator with 4 and 32 threads, the latter issues 28% more instructions overall.  These are presumably synchronization instructions. Traditional GPUs have a fixed function scheduler.

The second area to look at is the interconnect.  The L2 cache is shared by all cores and is a potential bottleneck.  This graph shows what percentage of cycles new requests are queued into the L2 pipeline.  The interface is not saturated, and interestingly falls off at 8 cores, presumably because some other bottleneck has begun to degrade performance.
This graph shows the average amount of time hardware threads spend in various states.  As expected, the amount of time waiting because the store buffer (orange) is full increases as the number of cores increase, a consequence of the write-through design.
Because this is not doing texturing and the caches are fairly large, the dcache miss time is negligible.


This analysis didn't turn up any smoking gun bottlenecks, but plenty of areas for further investigation.  Some of the areas that intuitively seemed like they would be expensive, such as pixel writeback, turned out to be relatively minor. Other areas, like texture sampling, were obvious areas for hardware improvements, and the data confirmed this.


  1. I think the execution part of the cpu-gpu is very inefficient.
    About half of the time spent (as shown from the last graph) is due to RAW dependencys and write back conflicts.
    Imagine a cpu able to execute 1 instruction per clock.Every instruction sent to an execution unit with different latency will cause a 1 cycle stall.
    So for a small reduction in die size,i think less than 21%,there is a 21% performance penalty.
    Also,as i understand,operand fetch takes 2 clock cycles to complete,so for a simple integer addition the latency is 4 cycles,far larger than most of the cpus which have latency of 1.(Even pentium 4 with about 30 pipeline stages has latency of 1).Fpu add usually has latency of 3-4.Nyuzi has 7-8.
    Finally,nyuzi uses multi threading and not simultaneous multi threading.Cpus with SMT can execute instructions from different threads at the same clock cycle.

    1. I don't think inefficient is quite the right word. It is a deliberate design decision to increase the length of the pipeline to improve clock speed. Most GPUs philosophically emphasize throughput over latency.The figure of merit is utilization (fraction of cycles that the processor issues an instruction). Despite the fact that individual threads fair amount of time blocked, there is usually one that is ready. This is the whole point of hardware threads.

      The previous iteration of this design had a shorter pipeline and a forwarding network to reduce latency. But it's maximum clock speed was also half the new one. The new design is significantly faster overall on FPGA.

    2. I understand that it is optimized for throughput.When a RAW dependency occurs the cpu can execute from a different thread,but when a write back conflict occurs, this prevents the cpu from executing from an other thread.

      Also when I say the latency should be lower,i do not mean necessarily it must be 1 cycle.Even 3 cycles for integer instructions,by having 1 cycle latency for operand load, is a 25% improvement,in my opinion big.

    3. That's not correct about writeback conflicts. If another thread has an instruction with a different latency, it can still issue.

      I actually added the second stage to the fetch stage later in implementation, and the impact was negligible in terms of cycles. Consider the simplest case: if you have four threads and each instruction has four cycles of latency that you can issue one instruction every cycle. The pipeline will be 100% utilized. At that point, you cannot run any faster and latency is irrelevant.

    4. For the write back stage i don't mean it cant execute from an other thread but that it is more difficult to find an other instruction to execute.

      For the second point the case you mention is just the BEST possible(for RAW dependencies).The other threads may wait due to a cache miss or for FPU result which is ,which has 8 cycles latency.

      I think you pressure multi-threading a lot.No branch prediction,long latency,write back conflicts.I think it cant do miracles.

    5. Fortunately, threads don't need to perform miracles, they just need to hide latency, which they are pretty good at. :)

      You bring up a few interesting points:
      - The case I described above isn't the best case, since it assumes every instruction is dependent on the previous one. This usually isn't the case, which is the reason for the scoreboard:

      - I did experiment with branch prediction in a previous version of the microarchitecture. It didn't help much:

      - I think you are approaching latency with a very CPU-centric view. Check out this cool paper: Specially section IV B, and table III. The latency of an add instruction on the GPU is 24 cycles.