Benchmarking Go FFI

8 October, 2023

All programming languages offer a way to call C function and libraries, via a mechanism called FFI (Foreign Function Interface). This allows for compatibility with code written in a different programming language or in some case to make certain operation faster (This is what numpy does in python: it’s a python lib that perform computation written in C).

Conflicting information exists online regarding the performance of CGO, the FFI feature in Go. Some sources assert that it’s slow, while others argue that its speed is sufficient and that the overhead is actually negligible.

Let’s measure the actually cost by running some benchmarks.

First, a baseline of the cost of a function call

To establish a baseline, we’ll measure the cost of a function call using a simple, pure Go function that performs an integer addition:

func add(x, y uint64) uint64{
  return x + y
}

Via the following benchmark:

func BenchmarkAddPureGo(b *testing.B) {
	x := uint64(0x12345678)
	y := uint64(0xABCDEF00)
	for i := 0; i < b.N; i++ {
		add(x, y)
	}
}

Let’s run it:

❯ go test . -bench=.
goos: linux
goarch: amd64
pkg: cgo-bench
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkAddPureGo-8	1000000000			0.2176 ns/op
PASS
ok		cgo-bench	0.245s

The benchmark ran in 0.245 seconds, and executed add 1000000000 times, which means that each execution took 0.2176 ns. Given that my processor runs at 3GHz, or 0.33 ns per cycle, this number is suspiciously low. Maybe the compiler optimized away most of our benchmark ?

It’s possible to verify that by looking at the assembly produced for the benchmark:

❯ go test -c . # create an executable file
❯ go tool objdump -gnu -s BenchmarkAddPureGo cgo-bench.test
TEXT cgo-bench.BenchmarkAddPureGo(SB) /home/aurelien/personnal/cgo-bench/main_test.go
  0x4f8480		31c9			XORL CX, CX                          // xor %ecx,%ecx
  0x4f8482		eb03			JMP 0x4f8487                         // jmp 0x4f8487
  0x4f8484		48ffc1			INCQ CX                              // inc %rcx
  0x4f8487		483988a0010000	CMPQ CX, 0x1a0(AX)                   // cmp %rcx,0x1a0(%rax)
  0x4f848e		7ff4			JG 0x4f8484                          // jg 0x4f8484
  0x4f8490		c3				RET                                  // retq

There’s no CALL or ADD instruction here. It seems that the compiler realised that he was calling a function that had no side effect and those result was never used and thus decided to remove it completely. What remains is the code of the loop itself that increments a variable until it reaches a certain value.

This kind of behavior is usually good to have in compiler because it make the programs faster, but here we don’t want that. This behavior can be prevented by storing the result of the add function in a global variable. This makes it more difficult for the compiler to figure out that the result is never used, and may prevent this optimization to happen.

Let’s try with this new benchmark function:

var result uint64

func BenchmarkAddPureGo(b *testing.B) {
	x := uint64(0x12345678)
	y := uint64(0xABCDEF00)
	for i := 0; i < b.N; i++ {
		y = add(x, y)
	}
	result = y
}

❯ go test -c .
❯ go tool objdump -gnu -s BenchmarkAddPureGo cgo-bench.test
TEXT cgo-bench.BenchmarkAddPureGo(SB) /home/aurelien/personnal/cgo-bench/main_test.go
  main_test.go:19	0x4f8480		31c9			XORL CX, CX                          // xor %ecx,%ecx
  main_test.go:19	0x4f8482		ba00efcdab		MOVL $-0x54321100, DX                // mov $-0x54321100,%edx
  main_test.go:22	0x4f8487		eb0b			JMP 0x4f8494                         // jmp 0x4f8494
  main_test.go:22	0x4f8489		48ffc1			INCQ CX                              // inc %rcx
  main_test.go:23	0x4f848c		90				NOPL                                 // nop
  main_test.go:6	0x4f848d		4881c278563412	ADDQ $0x12345678, DX                 // add $0x12345678,%rdx
  main_test.go:22	0x4f8494		483988a0010000	CMPQ CX, 0x1a0(AX)                   // cmp %rcx,0x1a0(%rax)
  main_test.go:22	0x4f849b		7fec			JG 0x4f8489                          // jg 0x4f8489
  main_test.go:25	0x4f849d		48891524521400	MOVQ DX, cgo-bench.result(SB)        // mov %rdx,0x145224(%rip)
  main_test.go:26	0x4f84a4		c3				RET                                  // retq

The ADDQ instruction is now present in the generated assembly, so code does add the 0x12345678 constant to some register in a loop. However there is no CALL instruction to be seen here. This is because an other optimization was done by the compiler: it inlined the function call. Inlining is a optimization where the compiler replace the function call by the code of the function itself. This removes the overhead of the calling of the function, so that the cpu can spend all of it’s time actually running the code inside the function.

This optimization can be disabled by using the go:noinline magic comment. It instructs the compiler to never optimize the function.

//go:noinline
func add(x, y uint64) uint64 {
	return x + y
}

And now the compiler doesn’t inline the function anymore:

❯ go test -c .
❯ go tool objdump -gnu -s BenchmarkAddPureGo cgo-bench.test
TEXT cgo-bench.BenchmarkAddPureGo(SB) /home/aurelien/personnal/cgo-bench/main_test.go
  main_test.go:20	0x4f84a0		493b6610		CMPQ 0x10(R14), SP                   // cmp 0x10(%r14),%rsp
  main_test.go:20	0x4f84a4		7658			JBE 0x4f84fe                         // jbe 0x4f84fe
  main_test.go:20	0x4f84a6		4883ec20		SUBQ $0x20, SP                       // sub $0x20,%rsp
  main_test.go:20	0x4f84aa		48896c2418		MOVQ BP, 0x18(SP)                    // mov %rbp,0x18(%rsp)
  main_test.go:20	0x4f84af		488d6c2418		LEAQ 0x18(SP), BP                    // lea 0x18(%rsp),%rbp
  main_test.go:20	0x4f84b4		4889442428		MOVQ AX, 0x28(SP)                    // mov %rax,0x28(%rsp)
  main_test.go:20	0x4f84b9		31c9			XORL CX, CX                          // xor %ecx,%ecx
  main_test.go:20	0x4f84bb		ba00efcdab		MOVL $-0x54321100, DX                // mov $-0x54321100,%edx
  main_test.go:23	0x4f84c0		eb22			JMP 0x4f84e4                         // jmp 0x4f84e4
  main_test.go:23	0x4f84c2		48894c2410		MOVQ CX, 0x10(SP)                    // mov %rcx,0x10(%rsp)
  main_test.go:24	0x4f84c7		b878563412		MOVL $0x12345678, AX                 // mov $0x12345678,%eax
  main_test.go:24	0x4f84cc		4889d3			MOVQ DX, BX                          // mov %rdx,%rbx
  main_test.go:24	0x4f84cf		e8acffffff		CALL cgo-bench.add(SB)               // callq 0x4f8480
  main_test.go:23	0x4f84d4		488b4c2410		MOVQ 0x10(SP), CX                    // mov 0x10(%rsp),%rcx
  main_test.go:23	0x4f84d9		48ffc1			INCQ CX                              // inc %rcx
  main_test.go:26	0x4f84dc		4889c2			MOVQ AX, DX                          // mov %rax,%rdx
  main_test.go:23	0x4f84df		488b442428		MOVQ 0x28(SP), AX                    // mov 0x28(%rsp),%rax
  main_test.go:23	0x4f84e4		483988a0010000	CMPQ CX, 0x1a0(AX)                   // cmp %rcx,0x1a0(%rax)
  main_test.go:23	0x4f84eb		7fd5			JG 0x4f84c2                          // jg 0x4f84c2
  main_test.go:26	0x4f84ed		488915d4511400	MOVQ DX, cgo-bench.result(SB)        // mov %rdx,0x1451d4(%rip)
  main_test.go:27	0x4f84f4		488b6c2418		MOVQ 0x18(SP), BP                    // mov 0x18(%rsp),%rbp
  main_test.go:27	0x4f84f9		4883c420		ADDQ $0x20, SP                       // add $0x20,%rsp
  main_test.go:27	0x4f84fd		c3				RET                                  // retq
  main_test.go:20	0x4f84fe		4889442408		MOVQ AX, 0x8(SP)                     // mov %rax,0x8(%rsp)
  main_test.go:20	0x4f8503		e818cff6ff		CALL runtime.morestack_noctxt.abi0(SB) // callq 0x465420
  main_test.go:20	0x4f8508		488b442408		MOVQ 0x8(SP), AX                     // mov 0x8(%rsp),%rax
  main_test.go:20	0x4f850d		eb91			JMP cgo-bench.BenchmarkAddPureGo(SB) // jmp 0x4f84a0

❯ go tool objdump -gnu -s cgo-bench.add  cgo-bench.test
TEXT cgo-bench.add(SB) /home/aurelien/personnal/cgo-bench/main_test.go
  main_test.go:7	0x4f8480		4801d8			ADDQ BX, AX                          // add %rbx,%rax
  main_test.go:7	0x4f8483		c3				RET                                  // retq

The CALL to cgo-bench.add if indeed here. There’s also a bit of boilerplate around to abide to the go calling convention and to put the function parameters in the correct registers %rax and %rbx

Now the benchmark should give a better result for the cost of a function call:

❯ go test . -bench=BenchmarkAddPureGo
goos: linux
goarch: amd64
pkg: cgo-bench
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkAddPureGo-8	1000000000			0.9436 ns/op
PASS
ok		cgo-bench	1.052s

The CGO version

Let’s write a CGO version of our add function, that actually calls and add function defined in C: It’s possible to write C code in a go file directly by writing in in the comments above the import "C" statement.

package cgo_bench

/*
   #include <stdint.h>

   uint64_t add(uint64_t a, uint64_t b) {
	  return a + b;
   }
*/
import "C"

func addcgo(x, y uint64) uint64 {
	x_c := C.uint64_t(x)
	y_c := C.uint64_t(y)
	sum, err := C.add(x_c, y_c)
	if err != nil {
		panic("faild to call C add implementation ")
	}
	return uint64(sum)
}

Note that this code needs to be in a different file as we can’t import "C" from test files due to a limitation of the go toolchain

Here is the associated benchmark

func BenchmarkAddCGo(b *testing.B) {
	x := uint64(0x12345678)
	y := uint64(0xABCDEF00)
	for i := 0; i < b.N; i++ {
		y = addcgo(x, y)
	}
	result = y
}

And the result:

❯ go test . -bench=BenchmarkAddCGo
goos: linux
goarch: amd64
pkg: cgo-bench
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkAddCGo-8		29442754		   40.70 ns/op
PASS
ok		cgo-bench	1.205s

Using the cgo version is indeed slower: it takes around 40 nanoseconds for to process each call to addcgo

Conclusion

Utilizing the Go FFI to call a C function incurs an overhead of approximately 40 nanoseconds. For exceptionally simple functions that execute within a few nanoseconds, this overhead can significantly impede performance. Nevertheless, in most practical scenarios, this additional load is likely to be trivial. For instance, if the C function requires 1 µs to execute, the addition of 40 ns due to CGO represents a mere 4% overhead. When the C function execution time extends to 1ms, the CGO overhead diminishes to only 0.004%.

CGO usage presents another drawback: it hampers the compiler’s ability to inline these C functions, potentially hindering some compiler optimizations. But again, if the C functions are substantial enough, this limitation should have minimal impact on performance.

Consequently, incorporating CGO into your next project is unlikely to create significant performance issues.

A final note: the benchmark only measures the overhead of calling CGO within a single thread. It appears that using CGO on large multicore servers may introduce small contention issues, as discussed in more detail here: https://shane.ai/posts/cgo-performance-in-go1.21/

A first blog post entry ?

31 March, 2023

Let’s try to fill these with somehow interesting stuff…