Sunday, December 1, 2013

Evaluating Instruction Set Tradeoffs Quantitatively

While a great deal of research is devoted to CPU performance at the microarchitectural level, I haven't found much rigorous analysis of performance tradeoffs inherent to the design of instruction sets themselves. And while it's instructive to examine existing processor instruct sets, without knowing what alternative options were considered, what the tradeoffs where, and ultimately why decisions were made, a lot of important information is missing. For example, one key insight in early research into RISC performance came from analysis of instruction frequencies in common CISC programs.  It would found that the more complex instructions were used substantially less frequently than simpler ones.

I've decided to explore some of the decisions I made in my own processor design (https://github.com/jbush001/GPGPU) empirically.  This processor is targeted by a C++ compiler based on the LLVM toolchain. The beauty of LLVM is that backend code generation is heavily data driven based on templated patterns specified in configuration files processed by a tool called "TableGen." This makes it relatively easy to quickly make substantial changes to code generation.  My experimental methodology is to disable usage of certain instruction set features, and measure the resulting performance of a compiled program.



The test program I used is a 3D renderer implemented in C++.  This renders a single frame containing the standard Utah Teapot model with around 2300 triangles into a 512x512 framebuffer using a Phong shading model. It utilizes both hardware multithreading and the wide vector unit. The core engine is around 3000 lines of source code and compiles compiles to 16k of code and 59k of read only data (most of which is geometry data for the model).

All tests are performed measuring the total number of instructions issued while running in the functional simulator. While a cycle accurate simulator of this design is available, this would arguably be heavily microarchitecture dependent and my goal is to evaluate the instruction set .  Since instructions have different costs, this is an imperfect measurement.  However, it gives a rough idea of the impact of each feature.

I've chosen three fundamental features of the instruction set to evaluate.

Feature Analysis


Arithmetic with Immediate Operands


An "immediate" instruction form allows encoding one constant operand directly in the instruction rather than requiring it to first be loaded into a register. For example:

    add_i s0, s1, 12

This is fairly common in many architectures, and is useful in a number of common scenarios, such as incrementing or decrementing loop counters or pointers, encoding indices for bit arithmetic, and checking loop counts.  In this architecture, the immediate field has a maximum size of 15 bits and a range of 32767 to -32768)  Values larger than this must be transferred into a register using a memory load instruction from the read-only section of the program.

One of the big tradeoffs in this architecture is the number of bits each feature takes to be encoded in the instruction and this feature is no exception.

For the first experiment, I disabled arithmetic instructions that utilized this format. These now require an additional instruction to load the value into the register (although the compiler is generally smart enough to retain values across multiple uses, so that cost is amortized over multiple iterations of a loop, for example).

Eliminating immediate arithmetic instructions reduced performance by around 4.2%.  This actually turned out to be much smaller than I expected, but is nonetheless arguably large enough to be important.

Baseline19982553 instructions
No immediate20869264 instructions
(Lower is better)

One question is whether this instruction format could be replaced by a set of simpler instructions, for example increment and decrement.  I instrumented the compiler to dump all of the constants that were emitted during compilation of the test program. There were 124 unique constants used, which would suggest that there is value to having constants encoded.

Mixing Vector and Scalar Operands


One of the features of this instruction set that differs from more conventional vector architectures--at least the ones that I have studied--is the ability to mix vector and scalar operands. For example:

    add_i v0, v2, s1

In this case, the value in scalar register s1 is duplicated across all lanes. There are a number of examples of places where this is useful, for example when loop values are shared among parallel instances of an execution kernel, or for 'uniform' variables like information about directional lights. The tradeoff of this feature, as before, is that several additional instruction bits are needed to encode the types of the instruction operands.

Static analysis of the code shows that this feature isn't frequently used across the entire program, but is used often within core routines that are executed frequently. Disabling this feature requires one extra instruction per use to copy the value from a scalar to vector register first.  Running an experiment with this disabled suggested that the feature offers only a 2.1% performance increase.

Baseline19982553 instructions
No mixing20405634 instructions 
(Lower is better)

Register transfer instructions are relative cheap (compared to memory operations, for example), so I would expect the impact in a real system to be even smaller.

Register File Size


There is a fair amount of research on how the size of the register file correlates with performance. Although register file size would seem to be a microarchitectural tradeoff (area and energy), it has implications on the instruction set, specifically with the instruction encoding.  This design features 32 scalar registers and 32 vector registers.  Each register index must be encoded using a 5 bit index. As most operations must encode a destination register, two source registers, and, in the case of vector operations, a mask register, this means 20 bits are consumed by operands alone, leaving only 12 bits to encode the instruction format and operation.

The first experiment I ran varied the number of scalar registers.  The difference between the smallest register file in this set (8 registers) and the largest (28 registers) is around 20%, which is significant. The performance curve is relatively smooth: each additional register addition gives an incremental performance improvement:
The curve doesn't appear to be flattening too much near the maximum number of registers, which suggests continuing to add registers would result in additional performance improvements. Unfortunately, to increase available registers above the maximum shown here would require a significant change to the encoding of the instructions.

Performing the same experiment with vector registers gives a much smaller delta.  Between 8 and 32 registers, performance only differs by about about 0.3%.  Furthermore, above 16 vector registers, the performance is unchanged, as the working values used by this test program apparently fit in that many registers. However, this may be specific to the test program I used; I could contrive classes of programs that would be more heavily dependent on the number of vector registers.  I would be interesting to run this test on different types of workloads.

Conclusions


Some of the results were not what I expected. Some fundamental features of the architecture had relatively small overall impacts on performance, although one could argue that the combination of these features may add up to something significant.  One thing I didn't look at much is how these features interact with each other (for example, do the combination of vector/scalar mixing and immediate operands contribute the sum of their individual speed improvements?).  Running these tests with different workloads may also yield somewhat different results.

18 comments:

  1. We went through similar processes on Larrabee. Of course we had to deal with fixed constraints of the x64 instruction set (number of GPRs, etc), but because we were designing a new wide-SIMD ISA, we had a lot of flexibility in some areas. The very first thing I wrote was a compiler - a fairly simple one that took DirectX pixel shaders and compiled them to the LRB ISA (though it wasn't called LRB at the time of course!). This let us fairly quickly iterate through a bunch of decisions that way.

    Things we decided using this methodology:

    * Width of the VPU. 32 lanes of float32 was basically unbuildable. 16 and 8 were fairly close in performance, but the fact that 16 was 4x4 gave us some nice features such as simplifying shuffles, and also that a vector register was the same size as a cache line (64 bytes). In hindsight, the non-vector part of the core grew larger than we initially expected, so having only 8 lanes would not have saved much area or power per core, but would have cost a lot of performance.

    * Number of registers. 32 gave significant scheduling benefits over 16, and 64 gave a few more, but was too hard to build/encode.

    * Vector load-op instructions. This was really important. It saved a lot of extra load instructions, and it also saved a lot of register file bandwidth. A load-op doesn't need to store its intermediate in a register file, nor read it back out, and it doesn't pay the RAW-hazard latency - it's just all pipelined. You may wish to think about adding this at some point.

    One thing we had a massive argument over, and couldn't really get enough data for or against was having separate mask registers. Many of us really wanted to use the GPRs for masks, as you did, but the pipelining and floorplanning really didn't allow it. Partly this was because of the P54 legacy - the scalar unit retired several pipeline stages before the vector unit even read its sources, and dealing with the latency of vector compares (which would have produced GPR results) was considered too difficult. In the end, we had to change the P54 unit so much that it would have supported this, and in retrospect we may have made the wrong decision and needlessly complicated the ISA.

    I notice you have three bits of encoding space used to choose between three types of vector instruction - non-masked, masked, and inverse masked (also scalar). We very rarely needed the inverse of a mask in real shaders, and when we did we usually needed the inverse relative to an outer mask. So we'd already have a mask like 00011111 from an outer clause, then we'd do an if/else/endif clause and need the masks 00011100 and 00000011 for each side of that (we added a specific "not-and" logic instruction to generate it: res=A&~B). Having the mask 11100011 for free really wouldn't have been a useful thing.

    We also didn't want to burn an encoding bit for "no mask", so we burned a register number instead (an old MIPS trick). In our case it was mask register zero. But you already have a "special" GPR that nobody in their right mind would use as a mask - s31.

    Take these two things together, and you could reclaim those three encoding bits, or at least reduce them to one (scalar/vector). Just have all vector instructions take a mask, but if the mask specified is 31, that means "not masked". Saving encoding bits is absolutely worth it.

    ReplyDelete
    Replies
    1. Wow, thanks for the feedback!

      I always wondered about the separate mask registers on Larrabee, that's interesting.

      The inverted mask thing seemed like a good idea before I had done much experimentation with it, but I agree, it didn't turn out to be very useful. I'll probably rip that out.

      I had considered using a special register as an all ones mask, but when there is no mask in-use, I reuse those instruction bits as part of the immediate value. I ran into some use cases where the tiny immediate field (8 bits) seemed problematic, like computing stack frame offsets. However, after thinking about it a bit, I suspect the overhead of having to execute an extra instruction or two in those cases is probably negligible because it happens infrequently. I could always add a special instruction to compute an effective address if needed.

      To your last point, getting everything to fit in the 32 bit instruction has been one of the big challenges for me. The instruction space is pretty crowded.

      Delete
    2. We played with encodings to an almost absurd amount. LRB1 (KNF), LRB2 (KNC) and now AVX512 (KNL) each have totally different and incompatible encodings for almost identical ISAs. And each one took many engineer-years of discussion and bit-shuffling. I still think the first one was the best :-)

      One thing you might consider is going limited superscalar (or possibly even VLIW) and co-issuing vector and scalar instructions. Having scalar instructions be "for free" on LRB was incredibly useful, and it was common for code to have a third of its instructions be scalars. This allowed us to keep the expensive vector unit busy all the time, while the "life support" scalar instructions became free.

      We also allowed vector stores to pair with vector ALU as long as there was a free vector RF read port - we had three read ports to support multiply-add, so a vector store could pair with anything that wasn't a multiply-add without a memory source. That was handy too - combined with load-op, it meant that most memory loads and stores could happen in parallel with arithmetic.

      All this would complicate your design a lot of course.

      Delete
    3. Interesting idea. The vector unit is underutilized in many of the tests I've run. Even in heavily vectorized code, there are scalar operations that are common to all parallel instances. Being able to co-issue those seems like an opportunity to squeeze out a lot more performance. At first blush, implementing a simple VLIW doesn't seem too onerously complex. There would be a new chunk of logic in the fetch stage to align variable length instruction packets. The rest of the pipeline wouldn't need to be much more sophisticated: the compiler could do the heavy lifting of scheduling instructions appropriately to avoid conflicts. LLVM has a fairly flexible VLIW packetizer.

      The shared vector/scalar pipeline in the current architecture wouldn't map very well to this, since instructions can mix scalar and vector operands. It would probably make sense to remove that ability and simplify things, since it doesn't win that much performance anyway.

      Delete
    4. Yes, the compiler can do a lot of the hard work. LRB1 had a "just too late" pairing-checking system that let you issue instructions in any combo (x86 legacy requirement!), but would stall the machine pretty badly if you got the wrong pairing (vector+vector, used too many read ports, dependency between the instructions, etc). It preserved program integrity, but it wasn't fast. However the compiler could almost always schedule around the problems, but if not it would occasionally insert NOPs to avoid the stalls. It actually worked out fine in the end and kept the logic simple.

      We never used scalar floating-point math. Although in theory LRB1 had the legacy x87 unit for backward compatibility, in practice we ripped a bunch of scheduling logic out which made it much slower than just doing it in one lane of the vector pipe. Nobody missed it. So you probably only need to support integer arithmetic in the co-issue pipe (lower latency!)

      Delete
    5. Okay, so I could have normally issued scalar instructions go down the lowest vector lane and have a separate, integer-only scalar co-issue pipeline that is used opportunistically for the paired instruction when it meets the right criteria (there are no register port conflicts, scalar integer operation, etc). I could probably modify the functional simulator to detect this condition to determine how often it would take advantage of this.

      This is where my software background sometimes gets in my way: I kept thinking of how to structure the scalar pipeline to handle every case. Adding a little redundant hardware can often improve performance.

      Delete
    6. I ran some experiments here: http://latchup.blogspot.com/2014/06/faster-than-light.html

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I think the instruction encoding is complex.I think a more mips like encoding would be better.
    Example:
    7 bits 5 bits 5 bits 5 bits 5 bits 5 bits
    | op code | src1 | dest | mask | function | src2 |

    arithmetic immediate
    7 bits 5 bits 5 bits 15 bits
    | op code | src1 | dest | immediate |

    vector immediate
    7 bits 5 bits 5 bits 5 bits 10 bits
    | op code | src1 | dest | mask | immediate |

    jump uncoditional
    7 bits 25 bits
    | op code | offset |

    jump conditional
    7 bits 5 bits 20 bits
    | op code | src1 | offset |

    load - store
    7 bits 5 bits 5 bits 15 bits
    | op code | src1 | dest | offset |

    load - store mask
    7 bits 5 bits 5 bits 5 bits 10 bits
    | op code | src1 | dest | mask | offset |

    cache control
    7 bits 5 bits 20 bits
    | op code | src1 | offset |

    All instructions with immediate or offset are encoded in the op code field.
    The first type uses only some values from the opcode field and the remaining in function field.
    For vector instructions mask register 31 means no mask.I think 10 bits are enough because immediates tend to be small and i do not think the are used regularly in vector instructions.Also i think vector-scalar instructions with immediates are rearly used.
    Because branch instructions point to aligned position the last 2 bits can be shifted out increasing branch range.
    The old encoding had immediate bits in different positions and maybe this made instruction decoding more difficult.Risc-V made immediate encoding more complex to keep bits positions for immidiates constant.Risc-V isa spec http://riscv.org/download.html#tab_isaspec.Also non temporal memory instructions may help the cache.Larrabee uses them.Why did you use bits to indicate instruction types?I think they take much space and make extensions more difficult.

    ReplyDelete
    Replies
    1. The instruction encoding has evolved over time as I added and removed functionality, so it has accumulated some cruft. One task on my list is to re-encode it, mostly to allow space for adding more floating point operations in the future.

      Having the initial bits encode the format is basically the same as partitioning the opcode space on power-of-two boundaries (e.g. if I say bit 0 being 0 indicates immediate arithmetic, that would be the same as opcodes 0-127). My initial motivation for that was to make decoder logic simpler: I can look at a few initial bits and use those to steer muxes that select the appropriate instruction fields. It also is easier to manage in the LLVM backend, which uses multiclasses to generate all possible addressing modes of each operation type.

      Making the immediate be in different positions is trivial in hardware: it's just a multiplexer in the operand fetch (register) stage.

      The thing I didn't like about the MIPS 'function' field was that I needed to have a translation table for immediate operations to convert opcodes to ALU operations (since there are a bunch of operations that have both register-register and immediate forms).

      My current feeling is that instruction encodings are primarily aesthetic. The decode stage is a relatively tiny amount of the overall area of the design, and using multiplexers to extract instruction fields is trivial in hardware. That said, the thing that I like about what you've proposed is that it opens up the instruction space, which is a bit tight.

      Delete
    2. Although my last point wasn't meant to imply that isn't valuable. Keeping things simple is an important and sometimes underrated design goal. It makes things easier to debug and validate.

      My current decode stage ended up with a lot of random logic. It's probably a lot more complex than it needs to be. The scheme you've proposed could theoretically be done solely with a 128 entry ROM table that generates the control signals. I'm not sure about the area tradeoffs of that vs. random logic.

      Delete
  4. I think compare grater than and greater than or equal instructions are not necessary.
    For example:
    cmpgt_i s1,s2,5
    btrue s1,somewhere

    can be transformed to
    cmple_i s1,s2,5
    bfalse s1,somewhere

    ReplyDelete
  5. EDIT:They may help for mask instructions.
    I think scalar-scalar should use different name than vector-vector and vector-scalar instructions.

    ReplyDelete
  6. Instruction i think are not necessary:
    bnall.If all 16 low bits are not set then all bits are not set because only the 16 low bits can be set by a comparison.Use bfalse instead.
    sext8,sext16 I think load_s8 and load_s16 are enough.

    In my opinion,there should be clear separation between vector and scalar instructions.Vector instructions should use names like vshr instead of shr.
    add_i and sub_i can be add,since there are no add_u and sub_u

    ReplyDelete
  7. bnall is not the same as bfalse (The name is a bit confusing). The check is ((value & 0xffff) != 0xffff). So, for example, the value 1 would cause a branch for bnall. The intention was for SPMDish case. I added it for parity with ball. That said, it's probably not that useful of an instruction.

    The reason there isn't a separation between vector and scalar instruction mnemonics is because instruction can mix operand types. You can have the first parameter be vector and the second be scalar. The second is duplicated across all lanes. It avoids needing to do "broadcast" operations and copy scalar values into vector registers. This is used for handling variables that are constant across all instances in a shader. For example: loop variables, constants, and uniforms.

    add_i and sub_i have the prefixes to differentiate between the floating point versions add_f and sub_f.

    ReplyDelete
  8. Are clz and ctz executed in 1 stage?In intel cpus these instructions have 3 cycles latency.I think these may be in the critical path.Also i think for vector instructions they may be too costly in terms of area.You can use them only for scalar operations.

    ReplyDelete
    Replies
    1. EDIT:In arm cpus this instruction has 1 cycle latency(strange).I also found this http://stackoverflow.com/questions/2368680/count-leading-zero-in-single-cycle-datapath

      Delete
    2. I wouldn't expect CLZ/CTZ to be significantly worse than the barrel shifter or multiplier (bear in mind the floating point pipeline has both a leading zero counter and a shifter, and each is in a single stage). TimeQuest doesn't report these as critical path.

      I tried something similar to the binary search approach described in the stack overflow post, but it had much more delay because of the chained multiplexers.

      Delete