Thursday, August 21, 2014

"It works in the simulator..."

I do most hardware development in simulation. It allows a level of visibility that software developers can only dream of: I can see the state of every signal at every instant for the entire run. But there is a dark side to simulation: the synthesized design may not behave the same as the simulation. There are a few well known patterns that lead to synthesis/simulation mismatches that are easy enough to avoid with attentive coding, but the existence of these gremlins always leaves me uneasy...

I recently began bringup the next generation of my parallel processor on FPGA.  I'd shaken a lot of bugs out in simulation using randomized instruction streams and directed test programs.  I fixed some nit-picky synthesis errors and loaded it onto the board and... nothing.

In a situation like this, there are a few approaches I use.  First, I routed LEDs to display internal signals. The program counter was advancing, so it did appear to be executing code, but that was all I could tell using that debugging method. I had hacked together a simple embedded logic analyzer, which captures internal signals and dumps them out the serial port when triggered, which allowed me to get a more detailed view of what was going on.  Iteration was slow, because it takes over an hour to synthesize the design in my VirtualBox Linux instance and I don't have a lot of time each day to spend on this, but after a few weeks, I finally isolated three major problems and got the design running pretty well.

The first issue had to do with port direction specifiers: I had marked as an output port something that should be an input through sloppy copy/pasting. It seems like this should immediately raise an error, but the way the Verilog language is designed makes this less clear cut than one might expect. It is perfectly valid to read output signals within a module. It's also okay to connect multiple outputs together, because some of them may be driving it with the Z (high impedance) state. At any rate, Verilator was perfectly happy with it and it simulated fine.  However, Quartus, seeing that the output port had no driver within that module, optimized away the logic entirely (to be fair, it did print a warning)

The second issue had to do with the way Altera's built-in SRAM "Megafunction" (ALTSYNCRAM) behaves when the same address is read and written on different ports during the same cycle. This design expects that the read will return the data that is written on the other port.  Altera's documentation describes a parameter that controls how it should behave.  I set the value "NEW_DATA" as suggested.  However, as it turns out, my particular FPGA model doesn't support that value.  Quartus didn't raise a warning, it just ignored the parameter and returned the old data at that address.  Adding bypass logic in my own design remedied the issue.

But the last one was the most nasty.  I looked at this code many times and didn't spot it.  Can you?
    assert(dd_instruction.has_dest && !dd_instruction.dest_is_vector)
    wb_writeback_value <= sq_store_sync_success;
The problem is that a semicolon is missing from the end of the assert statement.  The language spec says that assert() basically works like an if statement.  If the condition in parentheses is true, the following statement is executed.  It can even have an else statement. I knew this, but I just overlooked the fact that the semicolon was missing, and hadn't appreciated what nasty bugs could be introduced if it was inadvertently omitted.

In simulation, leaving the semicolon off doesn't have change program behavior. Since the assertion value is always true (as it's supposed to be) the nonblocking assignment executes.  However, during synthesis, Quartus treats the assignment as part of the assertion, which is not synthesizable, and removes it entirely.  It doesn't give a warning as this is expected behavior.

In the end, these turned out to be fairly tractable, and I learned some new things to watch out for when I do see issues.

Anyway, here's a single core clocked at 50 Mhz rendering a Phong shaded teapot with 2300 triangles:

That's around 2 frames per second.  It's not setting any speed records yet, but it works! :)


  1. Hey Jeff,
    I am simulating altera's altsync ram using verilator. Though I can compile it fine and write as well I cannot get the output from q. I am using dual port ram with 32 bits input words. While simulating can you tell me how you increment the address,as in steps of 1 or something else

    1. I assume you're asking how I increment the program counter? Since instructions are fixed width and 4 bytes wide, I increment by four. The logic for this is here:

      This is complicated by the fact that the program counter is incremented speculatively and need to be rolled back if an instruction cache miss occurs.

      I can't really comment on your issue simulating altsyncram as I haven't seen the code you are using.