Restory – a tale of disaster recovery

  • Author Grzegorz Nosek
  • Published 15 Oct 2015
  • Category stories

It all started with a storage failure at our cloud service provider. JBODs-a-hanging, hypervisors-a-segfaulting, all in all, they had a rather fun day. The VMs finally started recovering but we couldn't log into one of them as sshd failed to start. I rebooted it with the VNC console attached and saw something like "ld.so inconsistency assertion failed" flashing by too quickly to remember exactly. Looks bad. Started googling and found an error message that looked rather similar to the one I saw on the console:

Inconsistency detected by ld.so: ../sysdeps/x86_64/dl-machine.h: 530: elf_machine_rela_relative: Assertion `((reloc->r_info) & 0xffffffff) == 8' failed!

Corrupted ELF (chaotic evil)

So. Yeah. All clear. Not really, but looks like a corrupted binary as there should be no other reason for an ELF section to become corrupted that could be explained by the outage. Just my luck, had to hit the ssh server of all things. Of course the suggested solution was to reboot the machine, which didn't quite work for me. Apparently some laptops corrupt memory upon suspend/resume.

I rebooted the server once again, this time with init=/bin/bash (as single user mode as it gets but the initramfs sets up LVM and stuff), ran fsck which fixed some errors and rebooted again to see if it magically helped (spoiler alert: it didn't). So, another reboot into single user mode.

I started with remounting the filesystem read-write via mount / -o remount,rw and was just about to start hacking, but first I decided I might need a more functional shell. At the very least, Ctrl-C would be nice, but that doesn't really work with pid 1 being bash. So, let's start another shell:

bash

Nope, still no job control (and frankly I didn't check Ctrl-C there). Another idea:

screen

Yup, the shell inside has working Ctrl-C and stuff but it's a plain old sh even without tab completion, like an animal. bash. Fixed. And now I have multiple shells if I need them (Ctrl-A C).

I started by checking MD5 sums on the libc6 package. I retype all the commands from memory so there are probably errors, but that's the gist:

md5sum --quiet -c var/lib/dpkg/info/libc6:amd64.md5sums

No errors there. On second thought, if libc got corrupted, I probably wouldn't even get this far. Let's try the ssh server this time. I can never remember if it's ssh, sshd, openssh-server or whatever:

md5sum --quiet -c var/lib/dpkg/info/*ssh*.md5sums

No errors there too. So it must be a library that's used by the sshd binary. Not to waste time, on one terminal (screen already came in handy) I started checking all the packages:

md5sum --quiet -c var/lib/dpkg/info/*.md5sums | less

and on another I checked the dependencies of sshd:

ldd /usr/sbin/sshd

I wanted to verify the MD5s of the packages containing the required files, so I started to write a monstrosity that would do this in one shot. I got up to

ldd /usr/sbin/sshd | awk '{ print $3 }' | grep / | xargs dpkg -S | cut -d: -f1 | sort -u | xargs grep -f ...

Err... what do I grep for? I tried to ls /var/lib/dpkg/info/*.md5sums | this_bigass_grep but on reflection I should have generated globs instead. Thankfully, the md5sum on the other terminal finished sooner than I expected.

Kind of like chmod -x chmod

Among other irrelevant (at this stage) changes, I found that libcrypto.so.1.0.0 doesn't match the packaged version. That would explain why sshd crashed, so now I just need to get a clean copy. I put one up on a (definitely plaintext) HTTP server nearby and tried to run wget to see if it worked.

# wget
Inconsistency detected blah blah

Yup, wget needs libcrypto too. Well, there's always netcat.

nc my.server.ip 80 > newlibcrypto.so
GET /libcrypto.so.1.0.0 HTTP/1.0

Waiting, waiting, waiting... hey, the binary isn't that large and it's a 10Gbps network! After a couple of seconds I got the shell back. Zero bytes. Remember the wisdom of the ancients: the network is faster when it's brought up.

ip link set dev eth0 up
ip a a dev eth0 x.x.x.x/xx
ip r a default via x.x.x.x
ping -c3 -w3 8.8.8.8

(ping with timeout as I'm still nervous about the Ctrl-C, even though I know it works). Okay, network is up, try again. This time I got the file but it's somewhat small. I look inside and see the HTTP response with headers and all, redirecting me to some weird server. Some leftovers from playing with web server configs? Let me pass a Host header:

nc my.server.ip 80 > newlibcrypto.so
GET /libcrypto.so.1.0.0 HTTP/1.0
Host: the-canonical-hostname.of.the.other.guy

Nope, no dice. Temporarily I switch to another file on the server that's small, plaintext and safe to dump on my terminal and drop the redirect to file:

nc my.server.ip 80
GET /pubkey.asc HTTP/1.0
Host: the-canonical-hostname.of.the.other.guy

Same 302. In a flash of brilliance I check the IP address. Sure enough, it had a typo (.25 instead of .125 or the other way round). Sorry, Mr Neighbour. One netcat later I had the file with HTTP headers attached in front of the binary. Thanks to HTTP/1.0 and no headers (Host wasn't necessary after all) at least I didn't have chunked encoding or compression to handle.

dd as a minimalistic HTTP parser

Now, to strip the HTTP headers from the file. I wasn't 100% sure vim would be binary safe so I decided against it. By playing with head -c I managed to find that the headers are about 230 bytes long, so I chomped off the 230 bytes:

dd if=newlibcrypto.so of=realnewlibcrypto.so bs=230 skip=1

No, this isn't the most efficient way to do this, but did the job, almost.

# file realnewlibcrypto.so
realnewlibcrypto.so: data

Oh, not good. Turns out checking for binary junk via terminal isn't that reliable either. One tool to remember is od (octal dump) which thankfully supports other output formats, including hex and printable ASCII:

head -c 10 /dev/urandom | od -t x1 -t c

Eventually I went a bit further into the file to be 100% certain I'm past the end of the headers and started to back up byte by byte:

dd if=newlibcrypto.so bs=240 skip=1 | file -
dd if=newlibcrypto.so bs=239 skip=1 | file -
dd if=newlibcrypto.so bs=238 skip=1 | file -

At some point (232?) I finally got an ELF library, x86_64, not stripped or some such.

dd if=newlibcrypto.so bs=232 skip=1 of=libcrypto.so.1.0.0
sync
sync
sync  # kinda like Hail Mary. Or Beetlejuice.
reboot -f

Aaaaand we're back online. Phew.

L'esprit de l'escalier

Could I have done better? Sure. Instead of trying to get the file over HTTP, I could have served it raw, via netcat:

working-host$ netcat -l 10000 < libcrypto.so.1.0.0
recovery# netcat ip.of.working.host 10000 > libcrypto.so.1.0.0

Also, apparently, there's a way to debug relocations to easily find the offending library:

LD_DEBUG=reloc wget

Try LD_DEBUG=all for pages and pages for extra fun.

Hopefully I won't need to do it again, but as an experienced optimist, I better keep this in mind.