Understanding Go panic output

Update November 2023: Go 1.17 and 1.18 improved the panic output from my original post. See the update below for details.

My code has a bug. 😭

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x751ba4]
goroutine 58 [running]:
github.com/joeshaw/example.UpdateResponse(0xad3c60, 0xc420257300, 0xc4201f4200, 0x16, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/joeshaw/example/resp.go:108 +0x144
github.com/joeshaw/example.PrefetchLoop(0xacfd60, 0xc420395480, 0x13a52453c000, 0xad3c60, 0xc420257300)
        /go/src/github.com/joeshaw/example/resp.go:82 +0xc00
created by main.runServer
        /go/src/github.com/joeshaw/example/cmd/server/server.go:100 +0x7e0

This panic is caused by dereferencing a nil pointer, as indicated by the first line of the output. These types of errors are much less common in Go than in other languages like C or Java thanks to Go’s idioms around error handling.

If a function could fail, the function must return an error as its last return value. The caller should immediately check for errors from that function.

// val is a pointer, err is an error interface value
val, err := somethingThatCouldFail()
if err != nil {
    // Deal with the error, probably pushing it up the call stack
    return err
}

// By convention, nearly all the time, val is guaranteed to not be
// nil here.

However, there must be a bug somewhere that is violating this implicit API contract.

Before I go any further, a caveat: this is architecture- and operating system-dependent stuff, and I am only running this on amd64 Linux and macOS systems. Other systems can and will do things differently.

Line two of the panic output gives information about the UNIX signal that triggered the panic:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x751ba4]

A segmentation fault (SIGSEGV) occurred because of the nil pointer dereference. The code field maps to the UNIX siginfo.si_code field, and a value of 0x1 is SEGV_MAPERR (“address not mapped to object”) in Linux’s siginfo.h file.

addr maps to siginfo.si_addr and is 0x30, which isn’t a valid memory address.

pc is the program counter, and we could use it to figure out where the program crashed, but we conveniently don’t need to because a goroutine trace follows.

goroutine 58 [running]:
github.com/joeshaw/example.UpdateResponse(0xad3c60, 0xc420257300, 0xc4201f4200, 0x16, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/joeshaw/example/resp.go:108 +0x144
github.com/joeshaw/example.PrefetchLoop(0xacfd60, 0xc420395480, 0x13a52453c000, 0xad3c60, 0xc420257300)
        /go/src/github.com/joeshaw/example/resp.go:82 +0xc00
created by main.runServer
        /go/src/github.com/joeshaw/example/cmd/server/server.go:100 +0x7e0

The deepest stack frame, the one where the panic happened, is listed first. In this case, resp.go line 108.

The thing that catches my eye in this goroutine backtrace are the arguments to the UpdateResponse and PrefetchLoop functions, because the number doesn’t match up to the function signatures.

func UpdateResponse(c Client, id string, version int, resp *Response, data []byte) error
func PrefetchLoop(ctx context.Context, interval time.Duration, c Client)

UpdateResponse takes 5 arguments, but the panic shows that it takes more than 10. PrefetchLoop takes 3, but the panic shows 5. What’s going on?

To understand the argument values, we have to understand a little bit about the data structures underlying Go types. Russ Cox has two great blog posts on this, one on basic types, structs and pointers, strings, and slices and another on interfaces which describe how these are laid out in memory. Both posts are essential reading for Go programmers, but to summarize:

Strings are two words (a pointer to string data and a length)
Slices are three words (a pointer to a backing array, a length, and a capacity)
Interfaces are two words (a pointer to the type and a pointer to the value)

When a panic happens, the arguments we see in the output include the “exploded” values of strings, slices, and interfaces. In addition, the return values of a function are added onto the end of the argument list.

To go back to our UpdateResponse function, the Client type is an interface, which is 2 values. id is a string, which is 2 values (4 total). version is an int, 1 value (5). resp is a pointer, 1 value (6). data is a slice, 3 values (9). The error return value is an interface, so add 2 more for a total of 11. The panic output limits the number to 10, so the last value is truncated from the output.

Here is an annotated UpdateResponse stack frame:

github.com/joeshaw/example.UpdateResponse(
    0xad3c60,      // c Client interface, type pointer
    0xc420257300,  // c Client interface, value pointer
    0xc4201f4200,  // id string, data pointer
    0x16,          // id string, length (0x16 = 22)
    0x1,           // version int (1)
    0x0,           // resp pointer (nil!)
    0x0,           // data slice, backing array pointer (nil)
    0x0,           // data slice, length (0)
    0x0,           // data slice, capacity (0)
    0x0,           // error interface (return value), type pointer
    ...            // truncated; would have been error interface value pointer
)

This helps confirm what the source suggested, which is that resp was nil and being dereferenced.

Moving up one stack frame to PrefetchLoop: ctx context.Context is an interface value, interval is a time.Duration (which is just an int64), and Client again is an interface.

PrefetchLoop annotated:

github.com/joeshaw/example.PrefetchLoop(
    0xacfd60,       // ctx context.Context interface, type pointer
    0xc420395480,   // ctx context.Context interface, value pointer
    0x13a52453c000, // interval time.Duration (6h0m)
    0xad3c60,       // c Client interface, type pointer
    0xc420257300,   // c Client interface, value pointer
)

As I mentioned earlier, it should not have been possible for resp to be nil, because that should only happen when the returned error is not nil. The culprit was in code which was erroneously using the github.com/pkg/errors Wrapf() function instead of Errorf().

// Function returns (*Response, []byte, error)

if resp.StatusCode != http.StatusOK {
    return nil, nil, errors.Wrapf(err, "got status code %d fetching response %s", resp.StatusCode, url)
}

Wrapf() returns nil if the error passed into it is nil. This function erroneously returned nil, nil, nil when the HTTP status code was not http.StatusOK, because a non-200 status code is not an error and thus err was nil. Replacing the errors.Wrapf() call with errors.Errorf() fixed the bug.

Understanding and contextualizing panic output can make tracking down errors much easier! Hopefully this information will come in handy for you in the future.

Thanks to Peter Teichman, Damian Gryski, and Travis Bischel who all helped me decode the panic output argument lists.

Update

From the Go 1.17 release notes:

The format of stack traces from the runtime (printed when an uncaught panic occurs, or when runtime.Stack is called) is improved. Previously, the function arguments were printed as hexadecimal words based on the memory layout. Now each argument in the source code is printed separately, separated by commas. Aggregate-typed (struct, array, string, slice, interface, and complex) arguments are delimited by curly braces. A caveat is that the value of an argument that only lives in a register and is not stored to memory may be inaccurate. Function return values (which were usually inaccurate) are no longer printed.

And from the 1.18 release notes:

Go 1.17 generally improved the formatting of arguments in stack traces, but could print inaccurate values for arguments passed in registers. This is improved in Go 1.18 by printing a question mark (?) after each value that may be inaccurate.

A colleague recently had a crash similar to our example above. The relevant methods looked like this:

func (s *Service) GetCount(repo string) (count int64, errors []error)
func (s *Service) request(method string, url string, body []byte) (status int, response []byte, errors []error)

where s.GetCount(...) calls s.request(...).

The stack trace looked like this:

github.com/example/service.(*Service).request(0x0, {0x118368d?, 0xc000cd9b20?}, {0xc000588180?, 0x1?}, {0x0, 0x0, 0x0})
	/go/src/github.com/example/service/service.go:38 +0xdc
github.com/example/service.(*Service).GetCount(0xc000896700?, {0xc00084bed0?, 0x1ba03c0?})
	/go/src/github.com/example/service/service.go:69 +0xdc

You can see right away that the new output is a big improvement. The aggregated types (strings and slices in this example) are grouped together. Return values are omitted entirely.

Here it is again with my annotations:

github.com/example/service.(*Service).request(
    0x0,                          // *Service receiver, nil pointer (!)
    {0x118368d?, 0xc000cd9b20?},  // method string: pointer and length
    {0xc000588180?, 0x1?},        // url string: pointer and length
    {0x0, 0x0, 0x0}               // body []byte: pointer, length, capacity
)
	/go/src/github.com/example/service/service.go:38 +0xdc

github.com/example/service.(*Service).GetCount(
    0xc000896700?,                // *Service receiver, pointer
    {0xc00084bed0?, 0x1ba03c0?}   // repo string: pointer and length
)
	/go/src/github.com/example/service/service.go:69 +0xdc

Pretty clearly here you can see that the nil *Service receiver in the call to request is the problem. Something on line 38 is trying to dereference it and causing the crash.

But wait. GetCount calls request and its receiver is not nil (0x0). What’s going on?

The release notes above say that the stack trace could include “inaccurate values for arguments passed in registers” and signifies this by putting a question mark after such values.

GetCount does nothing with the receiver value other than passing it along to the request method. This means that when GetCount gets the receiver passed in as a register, it does not need to load it into RAM and we get the potentially inaccurate value in our stack trace.

request does do something with the value – dereferences it – requiring it to be loaded into RAM. That’s why the value is accurate in the request stack frame.