Cookie monster: chasing a Varnish vmod bug

  • Author Grzegorz Nosek
  • Published 29 Nov 2016
  • Category stories

"Guys, I can't upload this file via the admin panel". OK, so it's probably a body size limit somewhere. Clickety-click, nope. Oh well, let's debug this for real. Let's have a look at nginx's error log first:

2016/11/28 13:08:22 [error] 91451#91451: *67594161 recv() failed (104: Connection reset by peer)
while reading response header from upstream, client: (...), server: (...), request: "POST /index.php/(...)
HTTP/1.1", upstream: "http://127.0.0.1:6081/index.php/(...)", host: "(...)", referrer: "https://(...)"

I redacted it a bit to protect the innocent.

Not even a 50x

The upstream is a local Varnish cache server. This one looked ugly as we didn't get any response, just a closed socket. Varnish is not exactly known for extensive logging (unless you consider being drowned in varnishlog barf as logging) and in this case it was no different. A request comes in, no response comes out and that's it. Let's bring out heavier guns. Thankfully, the bug was easily reproducible (just try to upload the file again), so this time I was ready with sysdig.

# sysdig -s 1000 -w everything.scap
(clickety click)
# sysdig -vs 1000 -r everything.scap proc.name=varnishd | less

After some extra filtering I got this:

563074 13:54:31.723820614 7 varnishd (70917) < read res=32767 data=POST /index.php/(...)
(...)
563218 13:54:31.724385573 7 varnishd (70917) > rt_sigprocmask
563219 13:54:31.724387027 7 varnishd (70917) < rt_sigprocmask
563220 13:54:31.724389909 7 varnishd (70917) > tgkill pid=70784(varnishd) tid=70917(varnishd) sig=6(SIGABRT)
563221 13:54:31.724394794 7 varnishd (70917) < tgkill res=0
563222 13:54:31.724396661 7 varnishd (70917) > signaldeliver spid=0 dpid=70917(varnishd) sig=6(SIGABRT)

Boom. Death by SIGABRT, a modern form of Sudoku. The thread kills itself for some reason, probably it finds things are so broken internally that it cannot continue. I need a stack trace for this, so let's reproduce it in the staging environment. Actually, no, we won't as it works perfectly there. We should have swapped the machines (things always work perfectly in testing only to fail in production) and called it a day, but I ended up doing it the hard way.

The core of a cookie

I could not simply attach via gdb to a production server and avoid getting crucified, so the next plan involved a core dump. Which was of course disabled by resource limits:

# grep core /proc/PID/limits
Max core file size        0                    unlimited            bytes

Well... I could restart Varnish but that meant purging all this nice and warm cache. A quick google later, I found out about prlimit(1). A quick check later, I found out my distro is too old to have it (Ubuntu 14.04). Still, the syscall should be there, so...

# cat setprlimit.c
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64

#include <sys/time.h>
#include <sys/resource.h>

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
    struct rlimit rlim_new = {
        RLIM_INFINITY,
        RLIM_INFINITY
    };

    int pid = atoi(argv[1]);
    if (prlimit(pid, RLIMIT_CORE, &rlim_new, NULL) < 0) {
        perror("prlimit");
        return 1;
    }
    return 0;
}

# make setprlimit
# ./setprlimit PID
# grep core /proc/PID/limits
Max core file size        unlimited            unlimited            bytes

Wooooohooo! Reproduce the bug again and marvel at the beauty of /var/crash/_usr_sbin_varnishd.107.crash. Ubuntu wraps core dumps in its apport magic but it can be disarmed easily enough:

# apport-unpack /var/crash/_usr_sbin_varnishd.107.crash ./varnish-crash
# cd varnish-crash
# gdb /usr/sbin/varnishd ./CoreDump

After installing varnish-dbg the core dump actually gave a useful stack trace. The interesting part was:

#0  0x00007f833686ac37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f833686e028 in __GI_abort () at abort.c:89
#2  0x0000000000433d44 in pan_ic (func=0x474a3a <__func__.8200> "WS_Release", file=<optimized out>, line=<optimized out>, cond=<optimized out>,
    kind=<optimized out>) at cache/cache_panic.c:550
#3  0x000000000044b6a8 in WS_Release (ws=0x7f8323c86200, bytes=bytes@entry=40) at cache/cache_ws.c:231
#4  0x00007f83267f53fb in vmod_set (ctx=ctx@entry=0x7f832ac18190, priv=priv@entry=0x7f831dc9a2f8,
    name=name@entry=0x7f831cc0ec90 "\"\">(...)<p><p style",
    value=value@entry=0x7f831dc0f420 "\"\"text-align: justify") at vmod_cookie.c:171
#5  0x00007f83267f58e9 in vmod_parse (ctx=0x7f832ac18190, priv=0x7f831dc9a2f8, cookieheader=<optimized out>) at vmod_cookie.c:123

OK, a chunk of HTML from the uploaded file really shouldn't get parsed as a cookie. Whatever actually raised SIGABRT was only trying to limit damage (probably ran out of work memory trying to store a several-megabyte "cookie") and the bug was in the parsing code.

The Feynman Algorithm

So now all I had to do was to find and fix the bug in a single, reasonably self-contained function inside vmod_cookie. Let's use the Feynman Algorithm for this:

  1. Write down the problem.
  2. Think real hard.
  3. Write down the solution.

My attention was brought to this chunk:

if (sep != '\0')
    p = sep + 1;

See it in context on GitHub.

Hey, sep is a pointer! Whatever it points to, it definitely won't be equal to ASCII NUL (it could be a NULL pointer but there was a check for this just above). This check was clearly bogus and if sep happened to point to a NUL character, we'd skip over it and start parsing whatever came next in memory, like a several-megabyte HTML-in-CSV document made of pure hatred towards parsers.

The obvious fix (checking *sep instead of sep) didn't work (Varnish test suite caught some errors with it) but after a while I came up with a better patch that has just been merged.