I've been designing a simple GPGPU in my spare time. I recently implemented branch prediction, but when I ran the small suite of benchmarks I had written for it, I found that it only improved performance by a few percent. This may seem a bit puzzling at first blush because the benefits of branch prediction are well known. Was there a bug in my implementation? As it turns out, the answer is no, but the reason why is interesting.