Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Saturday, February 28, 2015

Automatic Cache Prefetching


Many optimizations take the form of "lazy evaluation."  When an operation may end up not being needed, we defer it as long as possible for the chance of avoiding it entirely. However, there is another class of optimizations which attempt to do work speculatively before we know that we need to. At first blush, it seems like there are a few conditions where it might be advantageous to do this:

 1. If the operation has a lot of latency.
 2. If the resource that is needed for the operation tends to be underutilized.
 3. The operation ends up being needed most of the time anyway.

I've experimented with automatically prefetching cache lines to improve performance in my GPGPU design. This is a common technique in many computer architectures. When a cache miss occurs, it will automatically read a few of following cache lines.

Friday, July 4, 2014

Messy Details

A problem with books and papers about computer architecture is that they gloss over messy details. The block diagram looks so simple, but, when you begin to implement it, you realize there are fundamental structural hazards. They give no hints about how to handle them. Often, a subtle detail completely alters the design. I'm going to talk about a lot of details around the cache hierarchy, specifically the way the L1 and L2 caches communicate with each other, and a design design with the current microarchitecture I am implementing.

Friday, June 6, 2014

Faster than light

In the last post, I discussed the fastest possible execution time of a test program, the "speed of light," as it were. There was an important assumption in this calculation: that the CPU issued one instruction per cycle. A common technique to improve performance in modern processors is issuing multiple instructions in the same cycle, also known as superscalar. I hadn't put much thought into a superscalar design given the focus on utilizing the wide vector unit. However, some helpful comments in a previous post have led me to reconsider this decision.

Tuesday, May 27, 2014

Keeping score

I recently embarked on a complete redesign of the microarchitecture for the GPGPU I've been working on, with a major goal being to increase the clock frequency. The previous version had a maximum frequency of around 30 Mhz when synthesized for my Cyclone IV FPGA board, constrained by some long combinatorial logic paths. One way to increase clock speed is to break stages apart, making the pipeline deeper. This, however, is not without tradeoffs. it increases the latency of each operation (in terms of clock cycles), and introduces new pipeline hazards. Combined together, these can decrease the performance of the CPU. I've attempted to mitigate this by utilizing a technique used by a number of modern GPUs (which was in-turn borrowed from early out-of-order microprocessors, although in this context it is used for in-order issue).

Sunday, December 8, 2013

Memory Access Visualization

It is difficult for the human mind to comprehend large differences in magnitude, and we often forget what large magnitudes exist within computers.  Even creating graphics to effectively display these differences can be challenging.  I've generated some visualizations of memory access patterns of a test program running on my GPGPU design, which renders a cube with bilinear filtered textures applied to the sides.  There are four hardware threads, each of which renders a 64x64 tile at a time.  The program requires around 1.6 million instructions to complete, performs around 500k memory accesses, and uses 2.25 Mb of working RAM.

Sunday, December 1, 2013

Evaluating Instruction Set Tradeoffs Quantitatively

While a great deal of research is devoted to CPU performance at the microarchitectural level, I haven't found much rigorous analysis of performance tradeoffs inherent to the design of instruction sets themselves. And while it's instructive to examine existing processor instruct sets, without knowing what alternative options were considered, what the tradeoffs where, and ultimately why decisions were made, a lot of important information is missing. For example, one key insight in early research into RISC performance came from analysis of instruction frequencies in common CISC programs.  It would found that the more complex instructions were used substantially less frequently than simpler ones.

I've decided to explore some of the decisions I made in my own processor design (https://github.com/jbush001/GPGPU) empirically.  This processor is targeted by a C++ compiler based on the LLVM toolchain. The beauty of LLVM is that backend code generation is heavily data driven based on templated patterns specified in configuration files processed by a tool called "TableGen." This makes it relatively easy to quickly make substantial changes to code generation.  My experimental methodology is to disable usage of certain instruction set features, and measure the resulting performance of a compiled program.

Friday, November 30, 2012

Waiting in line...

I updated the simple 3d engine running on the GPGPU I've been hacking on to use a more sophisticated runtime. The original incarnation of the engine used a single strand to perform all of the work serially. That won't scale up to multiple cores and performs suboptimally as utilizing hardware multi-threading is essential to get good performance on this architecture.

Sunday, November 11, 2012

Branch Prediction

I've been designing a simple GPGPU in my spare time. I recently implemented branch prediction, but when I ran the small suite of benchmarks I had written for it, I found that it only improved performance by a few percent. This may seem a bit puzzling at first blush because the benefits of branch prediction are well known. Was there a bug in my implementation? As it turns out, the answer is no, but the reason why is interesting.