Introduction

Toward the end of last year I became interested in FPGAs, in part because it is now possible to target them with an open tool-chain. I acquired some hardware and attempted to build a custom system-on-chip with LiteX, but (despite being a competent Python programmer) I struggled to make good progress because I didn’t properly understand digital design.

Conscious incompetence

With my confidence re-calibrated and more spare time indoors than expected (due to the COVID-19 lock-down) I decided to go back to basics and learn to build simple digital circuits with the Verilog hardware description language. I found a few tutorials on the web, but didn’t make real progress until I signed up on HDLBits and started working through the exercises. One of the most interesting HDLBits problems to implement Conway’s Game Of Life. Specifically it calls for:

  • A 16x16 grid that wraps around at the edges to provide an infinite (toroidal) surface.
  • Parallel update (specifically: “game state should advance by one time-step every clock cycle”).

Unlike most of the problems before it, I was unable to solve this one by just iterating on the Verilog using the web interface provided by HDLbits. While trying to simulate and debug my implementation with Icarus Verilog’s iverilog on my laptop, I quickly grew frustrated with the limitations of the behavioural part of Verilog as a language to build test-benches. Finally I discovered Verilator: a tool that generates C++ code from Verilog modules.

This is a blog post about my experience with Verilator, and other things…

John Conway

In the late 1960s, the mathematician John Conway came up with a cellular automata that he named the Game Of Life. It was first published in a column in Scientific American in 1970. Also known as Life, it has the interesting property of being Turing complete, meaning that (mathematically speaking) anything that can be computed algorithmically, can be computed within the game.

In Life, the world is comprised of a grid of cells, where each cell can be either alive or dead. Life proceeds in time steps, where the fate of a cell is determined by a set of rules that depends on the previous state of each cell and it’s eight neighbours:

  • A live cell with two or three live neighbours remains alive (the happy middle ground),
  • a dead cell with exactly three live neighbours is reborn (a three-parent-baby) and
  • for any other number of live neighbours, the cell dies, or stays dead.

Notice the overlap between the first and second rule for three live neighbours (the output is the same regardless of the cell’s previous state).

Putting this all together with a specific initial arrangements of cell state can result in (a range of) interesting patterns, some of which are stable. One of this simplest is known as a Glider:

Animation of a Glider

An animation of a Glider in Conway’s Game Of Life [Rodrigo Silveira Camargo]

In this inaugural blog post, I’ll use Conway’s Game Of Life to explore some technical minutia, as an homage to John Conway, who died in April from complications related to COVID-19, and to the far too many other elders we’ve lost recently.

Modelling digital logic

I learned that there are two flavours of digital logic:

  • Combinatorial logic is stateless in the sense that the output is (instantaneously) logical function of the input. Think of it as digital electronic gates connected by wires.
  • Sequential logic is stateful, meaning it has memory. The output can depend on the input and/or some previous (stored) condition.

I used combinatorial logic to wire up the same set rules for all 256 cells, such that they are updated at every clock cycle. Here is rule.v which models the logic for updating a single cell:

`default_nettype none

module rule (
    input [7:0] neigh,
    input current,
    output next );

    wire [2:0] pop;
    assign pop = {2'b00, neigh[0]} +
                 {2'b00, neigh[1]} +
                 {2'b00, neigh[2]} +
                 {2'b00, neigh[3]} +
                 {2'b00, neigh[4]} +
                 {2'b00, neigh[5]} +
                 {2'b00, neigh[6]} +
                 {2'b00, neigh[7]}; 
    
    reg tmp;
    assign next = tmp;

    always @(*) begin
        case (pop)
            2: tmp = current;
            3: tmp = 1;
            default: tmp = 0;
        endcase
    end
endmodule

This module takes the state of the cell (current) and the eight neighbours (neigh), does a lookup to determine the population count of neigh, and then maps that to the next state with the case structure.

Now to wire this into the main module defined in life.v:

`default_nettype none
`include "rule.v"

module life (
    input clk,
    input load,
    input [255:0] data,
    output reg [255:0] q ); 

    wire [255:0] next;
    
    genvar x, y;
    generate
        for (x=0; x<=15; x=x+1) begin : gen_x
            for (y=0; y<=15; y=y+1) begin : gen_y
                rule fate (
                    .neigh({q[(x==0 ? 15 : x-1) + (y==0 ? 15 : y-1)*16],
                            q[(x==0 ? 15 : x-1) + y                *16],
                            q[(x==0 ? 15 : x-1) + (y==15 ? 0 : y+1)*16],
                            q[x                 + (y==0 ? 15 : y-1)*16],
                            q[x                 + (y==15 ? 0 : y+1)*16],
                            q[(x==15 ? 0 : x+1) + (y==0 ? 15 : y-1)*16],
                            q[(x==15 ? 0 : x+1) + y                *16],
                            q[(x==15 ? 0 : x+1) + (y==15 ? 0 : y+1)*16]}),
                    .current(q[(x + y*16)]),
                    .next(next[(x + y*16)])
                );
            end
        end
    endgenerate
    
    always @(posedge clk) begin
        if (load) begin
            q <= data;
        end else begin
            q <= next;
        end
    end
endmodule

The ugliest part of this module is the combination of ternary operators (to handle wrap-around for the edge of the grid) with the arithmetic to calculate the offset of cells (in the 256 bit vector) based on their relative position (e.g., top-left) in x, y coordinates. I’m not super happy with my Verilog code. I’d like to eliminate the multiplication, and I’m probably using more registers than I should.

This is not a computer program

Don’t be fooled by the for loops. Normally, compiling Verilog (this step is called synthesis) produces a digital circuit. The generate block is executing a template that outputs parameterised copies of the statements inside the loop (in this case, one instantiation of the rule module). These Verilog control flow statements don’t output control logic.

To realise Life as a digital circuit, we would have 256 (16x16) groups of gates. I’m not sure exactly how many gates per group, but probably a few dozen at least.

Simulating the simulation

A simulator turns the Verilog description of hardware into a program that can execute on a conventional computer. It is much easier to test and debug software than complex digital circuits. A good simulator will let you generate stimulus (a sequence of test inputs) and then step through the sequence to observe the circuit’s response (be that by logging text messages, or reading a timing diagram).

My intuition is that (other than for testing by simulation) one wouldn’t want to implement an algorithm (say, a game like Life) to run on a general purpose computer this way when you could just use a conventional compiler for a general purpose programming language.

Why?, well it is certainly compilcated and surely it would be inefficient (i.e., slow)… right?

Enter the Verilator

Looping back to the introduction, Verilog generates C++ classes that implement the logic defined in Verilog modules. One can then compile those classes into a program that implements a simulation of the digital circuit. Thanks to the fantastic tutorials on the ZipCPU blog, I quickly got a simulation running and to my surprise, it was fast!

How fast?

Well, I didn’t have a baseline so I implemented a quick Python version of Life. Second surprise, the Python version is about ten time slower. Ok, I must have done something wrong. I re-implemented the Python code to use a single (big) integer for the grid as opposed to a 2D array (list-of-lists). No improvement.

At this point I thought: assuming I’m not a terrible Python programmer, then maybe Python is slower than I expected.

Next, I thought I should implement Life in C++ to get a sense of how slowly the Verilator-generated code runs by comparison. My first version used C++’s std::bitset for the grid. Another surprise: this version was still about 50% slower. Throwing out std::bitset and employing mostly C-style code had the pleasing effect of reducing the line count by 30-40% and speeding up the implementation to be more-or-less on par with Verilator’s generated code.

The generated Vlife.cpp file is 8,180 lines of code, while the equivalent hand-written C++ is less than 100 lines.

Profiling the C++ versions

I profiled execution for 10,000 generations of Life with both versions using callgrind.

Here is the Verilator version:

Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Collected : 333894020 67017216 32547995 11754767 14599 2468 4211 8079 1631

I   refs:      333,894,020
I1  misses:     11,754,767
LLi misses:          4,211
I1  miss rate:        3.52%
LLi miss rate:        0.00%

D   refs:       99,565,211  (67,017,216 rd + 32,547,995 wr)
D1  misses:         17,067  (    14,599 rd +      2,468 wr)
LLd misses:          9,710  (     8,079 rd +      1,631 wr)
D1  miss rate:         0.0% (       0.0%   +        0.0%  )
LLd miss rate:         0.0% (       0.0%   +        0.0%  )

LL refs:        11,771,834  (11,769,366 rd +      2,468 wr)
LL misses:          13,921  (    12,290 rd +      1,631 wr)
LL miss rate:          0.0% (       0.0%   +        0.0%  )

Here is my hand-coded version:

Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Collected : 384980712 66785393 29764871 1903 14565 2452 1801 8075 1617

I   refs:      384,980,712
I1  misses:          1,903
LLi misses:          1,801
I1  miss rate:        0.00%
LLi miss rate:        0.00%

D   refs:       96,550,264  (66,785,393 rd + 29,764,871 wr)
D1  misses:         17,017  (    14,565 rd +      2,452 wr)
LLd misses:          9,692  (     8,075 rd +      1,617 wr)
D1  miss rate:         0.0% (       0.0%   +        0.0%  )
LLd miss rate:         0.0% (       0.0%   +        0.0%  )

LL refs:            18,920  (    16,468 rd +      2,452 wr)
LL misses:          11,493  (     9,876 rd +      1,617 wr)
LL miss rate:          0.0% (       0.0%   +        0.0%  )

As you can see, they’re pretty close. Verilator’s C++ runs fewer instructions than my hand-coded C++, but somehow results in many more instruction cache misses. Both take about 0.6 seconds to run on my laptop.

How does Verilator produce such fast code?

I think my implementation of Life is particularly well suited to the optimisation tricks it uses to map combinatorial logic to efficient C++ code (albeit thousands of lines) which the compiler is in turn good at optimising to produce relatively nimble machine code. Note also that with sufficiently large designs (i.e, not life.v) Verilator can take advantage of multithreading to speed up simulations.

Okay then, mind blown. Verilator is Verigood.

Das blinkenlights

Thank you for making it this far! This is how the console output looks with pauses between the generations (this is the Verilator version, but the output is identical):

Verilog+Verilator+C++ version of Conway's Game Of Life

Console output from the Verilator-generated C++ code for Conway’s Game Of Life

The C++ code

For those keen to scrutinise my C++ implementation (or try running Life), here is the full listing of life.cpp:

#include <climits>
#include <cstdlib>
#include <unistd.h>

#include <chrono>
#include <iostream>
#include <string>
#include <thread>

#include "Vlife.h"
#include "verilated.h"

#define WIDTH (16UL)
#define FREERUNNING (-1)

#define GETCELL(b, x, y) ((b)[y] >> (WIDTH - (x) - 1UL) & 1UL)
#define SETCELL(b, x, y) ((b)[y] |= 1UL << (WIDTH - (x) - 1UL))
#define CLEARCELL(b, x, y) ((b)[y] &= ~(1UL << (WIDTH - (x) - 1UL)))
#define COPYCELL(b, x, y, v)        \
    if (v) {                        \
        SETCELL((b), (x), (y));     \
    } else {                        \
        CLEARCELL((b), (x), (y));   \
    }

bool codeMode;
bool sleepMode;
int iterations;

void bailOut(char const *msg) {
    std::cerr << msg << std::endl;
    exit(1);
}

void parseArgs(int argc, char *argv[]) {
    unsigned int iter;
    codeMode = false;
    sleepMode = true;
    iterations = FREERUNNING;
    for (;;) {
        switch (getopt(argc, argv, "nci:")) {
        case 'n':
            sleepMode = false;
            continue;
        case 'c':
            codeMode = true;
            ;
            continue;
        case 'i':
            char *end;
            iter = strtol(optarg, &end, 10);
            if (end != optarg || iter < 0) {
                iterations = iter;
            } else {
                bailOut("Invalid value for -i");
            }
            continue;
        case -1:
            break;
        case 'h':
        default:
            bailOut("Usage:\n"
                    "\t-n       Don't sleep between generations (default "
                    "sleeps for 100ms).\n"
                    "\t-c       Run C implementation (default is HDL "
                    "implementation).\n"
                    "\t-i <num> Run for <num> iterations (default is an "
                    "infinite loop).\n"
                    "\t-h       Show this message.\n");
        }
        break;
    }
}

void printGrid(const uint16_t *b) {
    unsigned char buf[WIDTH + 1];
    for (unsigned int y = 0; y < WIDTH; y++) {
        for (unsigned int x = 0; x < WIDTH; x++) {
            buf[x] = GETCELL(b, x, y) + '0'; // digit to ascii
        }
        buf[WIDTH] = 0;
        std::cout << buf << std::endl;
    }
}

void generation(const uint16_t *in, uint16_t *out) {
    for (unsigned int y = 0; y < WIDTH; y++) {
        for (unsigned int x = 0; x < WIDTH; x++) {
            unsigned int sum = 0;

            unsigned int left = (x == 0) ? WIDTH - 1 : x - 1;
            unsigned int right = (x == WIDTH - 1) ? 0 : x + 1;
            unsigned int above = (y == 0) ? WIDTH - 1 : y - 1;
            unsigned int below = (y == WIDTH - 1) ? 0 : y + 1;

            if (GETCELL(in, left, y)) sum++;
            if (GETCELL(in, right, y)) sum++;
            if (GETCELL(in, x, above)) sum++;
            if (GETCELL(in, x, below)) sum++;
            if (GETCELL(in, left, above)) sum++;
            if (GETCELL(in, left, below)) sum++;
            if (GETCELL(in, right, above)) sum++;
            if (GETCELL(in, right, below)) sum++;

            switch (sum) {
            case 2:
                COPYCELL(out, x, y, GETCELL(in, x, y));
                break;
            case 3:
                SETCELL(out, x, y);
                break;
            default:
                CLEARCELL(out, x, y);
            }
        }
    }
}

int main(int argc, char **argv) {
    parseArgs(argc, argv);

    uint16_t a[WIDTH] = {
        0b0000000000000000,
        0b0000000000000000,
        0b0000100000000000,
        0b0000010000000000,
        0b0001110000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000,
        0b0000000000000000};

    uint16_t b[WIDTH] = {0};

    uint16_t *in, *out, *t;
    in = a;
    out = b;

    Vlife *core;

    if (!codeMode) {
        core = new Vlife;
        memcpy(core->data, in, (WIDTH * WIDTH) / 8);
        core->load = 1;

        // TODO: figure out why load requires >= 2 clock cycles
        core->clk = 1;
        core->eval();
        core->clk = 0;
        core->eval();
        core->clk = 1;
        core->eval();
        core->clk = 0;
        core->eval();

        core->load = 0;
    }

    bool run = true;

    while (run) {
        std::cout << "\033[0m\033[2J\033[H"; // clear screen
        if (codeMode) {
            printGrid(in);
            // compute next state
            generation(in, out);
            // swap buffers
            t = in;
            in = out;
            out = t;
            std::cout << "Code";
        } else {
            printGrid(reinterpret_cast<uint16_t *>(core->q));
            // drive logic
            core->clk = 1;
            core->eval();
            core->clk = 0;
            core->eval();
            std::cout << "HDL";
        }

        if (iterations != FREERUNNING) {
            iterations--;
            run = iterations > 0;
            std::cout << " iter=" << iterations << std::endl;
        }

        if (sleepMode) {
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
        }
    }
}

Building and running

To build Life, place the three files above in a directory, install a recent version (I think 4.x should be fine) of Verilator and run:

verilator -Wall -cc life.v --exe --build -o life life.cpp

If all goes well, you’ll find the compiled program in ./obj_dir/life

Conclusion

So what have I learned?

  • Python is terrible.
  • For faster C++ code, write C code.
  • For the fastest C++ code, write Verilog.

Jokes aside:

While Verilog it is new to me, many of the tools and languages mentioned in this post are older than I am. Weirdly, the old age of software bodes well for the odds that most of it will outlast us… and that is Life!

John Conway in 2005

John Horton Conway FRS (1937 - 2020) [Thane Plambeck]