Miles to go ...

Gofast: performance benchmark

Gofast is a binary protocol implemented in golang for internet applications. It can do low latency, high throughput on a single socket connection with support for batching, keep-alive heartbeat, pipe-lining of request on the same connection and more… Check out its github source for more details.

Following benchmark is done using quadcore (8 thread) macbook pro with 16 GB of RAM.

Program under perf/ can be used to benchmark POST, REQUEST and STREAM messages using gofast. On the client side, there is no logic other than filling up a random payload and sending it to remote. Likewise server code is kept to bare-minimum.

$ go get github.com/bnclabs/gofast
$ cd $GOPATH/src/github.com/bnclabs/gofast/perf
$ go build

This will build the perf program which can be started either as server or client.

Start the server

To start the perf program as server use the -s switch, supply port address via -addr argument.

$ ./perf -s -addr :9900

POST Benchmark

To start the perf program as client use the -c switch, supply the server address via -addr argument. Do sufficiently large number of POST so that we can time the entire operation, in this case we will POST 1-million messages to server.

time ./perf -c -addr localhost:9900 -do post -count 1000000

Latency Average: 11.474µs
Throughput: 83334 /second

stats { 
  "n_dropped":0,
  "n_flushes":1000015,
  "n_mdrops":0,
  "n_rx":15,
  "n_rxbeats":12,
  "n_rxbyte":493,
  "n_rxfin":0,
  "n_rxpost":12,
  "n_rxreq":1,
  "n_rxresp":2,
  "n_rxstart":0,
  "n_rxstream":0,
  "n_tx":1000015,
  "n_txbyte":548000517,
  "n_txfin":0,
  "n_txpost":1000012,
  "n_txreq":2,
  "n_txresp":1,
  "n_txstart":0,
  "n_txstream":0
}
request stats: n:1000000 mean:11.474µs var:2m25.205163142s sd:381.057µs
./perf -c -addr localhost:9900 -do post -count 1000000  6.53s user 18.04s system 200% cpu 12.229 total

Client program completes in 18.4 seconds and dumps the final statistics for all transport. For rest of the experiments we are going to ignore the raw stats map and focus only on Latency and Throughput. In the above run, we were using default values for connections (1-connection), batchsize (1), and routines (1).

Let us increase concurrency to 100 routines and batchsize to 100 messages:

time ./perf -c -addr localhost:9900 -do post -batchsize 100 -routines 100 -count 1000000

Latency Average: 408.822µs
Throughput: 250001 /second

Throughput increases, but latency also increases. Are they related ? Let us decrease the batchsize back to 1.

time ./perf -c -addr localhost:9900 -do post -batchsize 1 -routines 100 -count 1000000

Latency Average: 702.487µs
Throughput: 142858 /second

batchsize shows definite improvement with concurrency routines. Let us increase the number of connections.

time ./perf -c -addr localhost:9900 -do post -conns 8 -batchsize 100 -routines 100 -count 1000000

Latency Average: 1.029988ms
Throughput: 727280 /second

Now we have a throughput of 727K/Sec POST operations using 8 connections. Although latency is pretty high, 1.4ms, on a local-loop we could attain half-million POST request using 100 concurrent routines. Concurrency can definitely help mitigate the high-latency/low-throughput situation, provided server logic can be as concurrent as the number of clients. And with this experiment, all 8 threads in my macbook-pro are fully saturated.

One last experiment with POST, let us decrease the payload from 512-bytes (default) to 64-bytes.

time ./perf -c -addr localhost:9900 -do post -conns 8 -batchsize 100 -routines 100 -payload 64 -count 1000000

Latency Average: 819.28µs
Throughput: 888896 /second

Decreasing the payload improves the latency by 20% and it reflects in throughput as well.

REQUEST Benchmark

Now that we have exercised most of the options, like number-of-connections, payload-size, batchsize, number-of-routines. For the remaining experiment we shall stick to 64-byte payload, 100 routines, 100 batchsize.

For singe connection, let us do 1-million request-response.

time ./perf -c -addr localhost:9900 -do request -conns 1 -batchsize 100 -routines 100 -payload 64 -count 1000000

Latency Average: 1.430403ms
Throughput: 71435

For a similar configuration, POST can do about 333K operations / sec. This is because REQUEST involves a round-trip of 64-byte payload from client to server and back from server to client, which leads to 2ms latency and hence only throughput of 71K/Sec.

To decrease the latency reduce batchsize to 1:

time ./perf -c -addr localhost:9900 -do request -conns 1 -batchsize 1 -routines 1 -payload 64 -count 1000000

Latency Average: 74.238µs
Throughput: 13513

We could bring down latency to 74us, although throughput suffers. Let us tune it further to get better throughput, increase batchsize to 8 and increase concurrency to 20:

time ./perf -c -addr localhost:9900 -do request -conns 1 -batchsize 8 -routines 20 -payload 64 -count 1000000

Latency Average: 354.416µs
Throughput: 21739

By increasing the number of connections to 8, with batchsize and concurrency at 100:

time ./perf -c -addr localhost:9900 -do request -conns 8 -batchsize 100 -routines 100 -payload 64 -count 1000000

Latency Average: 6.116246ms
Throughput: 129045

We get good throughput but poor latency.

STREAM Benchmark

Client can initiate a stream request, using -do streamtx and send -stream number of messages to remote. To initiate a stream request in reverse direction, use -do streamrx.

time ./perf -c -addr localhost:9900 -do streamtx -batchsize 1 -routines 1 -payload 64 -conns 1 -count 1 -stream 100000

Latency Average: 1.135048775s
Throughput: 100001

This run took 1.9 seconds to complete a single stream request and to stream 100K messages on that request. If we increase the number of stream-request to 2.

time ./perf -c -addr localhost:9900 -do streamtx -batchsize 1 -routines 1 -payload 64 -conns 1 -count 2 -stream 100000

Latency Average: 1.076442179s
Throughput: 100001

Let us increase the routines to 2:

time ./perf -c -addr localhost:9900 -do streamtx -batchsize 1 -routines 2 -payload 64 -conns 1 -count 2 -stream 100000

Latency Average: 2.29462642s
Throughput: 100001

Latency has jumped, this means there is congestion on the socket due to concurrent routines. Now, let us optimize by increasing batchsize, along with count and routines

time ./perf -c -addr localhost:9900 -do streamtx -batchsize 20 -routines 20 -payload 64 -conns 1 -count 20 -stream 100000

Latency Average: 8.180717349s
Throughput: 250002

We have managed to stream messages at 250K/sec rate. But due to increased congestion on the socket, latency is suffering here. Now let us make lot of stream request, with each request sending just 10 messages, and increase the number of stream-requests to 100K:

time ./perf -c -addr localhost:9900 -do streamtx -batchsize 100 -routines 100 -payload 64 -conns 1 -count 100000  -stream 10

Latency Average: 9.627805ms
Throughput: 91666

91K/Sec. througput at 9ms latency.