Open Collective
Open Collective
Loading
Broken Drum
Published on September 30, 2023 by Patrick Dersjant

As promised on mastodon, this is an attempt to reconstruct The Day I Broke The Drum.
By nature, it's a technical story. Yes, I'm going to try to be as open as I can about what went wrong. If you read this online, you'll notice bold italic text which should be readable by everybody; skip the normal text which contains too much technical details (TMTD) if you feel like it. Don't be afraid to ask me on mastodon if you're missing info.

Preamble: How the Drum is set up

The Mended Drum is run on a virtual server hosted in Germany. It consists of a server, a virtual harddisk, a linux operating system tailored by yunohost, and the mastodon software.
When starting the Drum a year ago I decided to go with a Hetzner VPS for hosting. The advantage is that these can be scaled, allowing for growth. The disadvantage of these VPS's is that the disk space is extremely limited: 40GB when started, now 80GB as we've grown. That's not nearly enough for hosting all the mastodon media files: with a cleaning job in place that keeps a 7-day cache, we're between 100-120 GB of usage now. (Older media files will be refetched from their originating server; local files won't be cleaned). I decided against using S3 buckets for file storage: for one, because I don't fully understand how it works and I want it to be simple;  two most buckets are outside the EU and I want to be GDPR-compliant; and three it's not cheaper, so no cost argument. That left me with Hetzner Volumes for storage. These can also be resized (handy!), and are mirrored by Hetzner so hardwareproblems won't affect us. The Drum has had a 200GB volume for some time now.
The VPS is backed up daily via Hetzners system (meaning a full server image is stored); seven backups are kept. The backup does not contain the volumes though, only the VPS itself!
On the software side, I chose to use yunohost. Built on top of Debian, it is a distribution meant to easily set up and maintain a reasonably secured internet server hosting one or more pre-packaged applications. The applications can be installed and upgraded via scripts fetching the relevant sources from the respective repositories. These scripts are updated by the community, but sometimes it takes a bit longer before new versions are tested and available for install/upgrade. Also, the scripting is quite rigid: it provides an application backup but you can’t specify to omit cache files: the automated backup would easily fill 140GB which our litle server doesn’t have.
 

Act I: Security update

I had an extended weekend away until September 18th. During this weekend, the people providing the Mastodon software announced an urgent security patch. Without this patch, the software could no longer be considered secure, so the announcement contained the advice to update as quickly as possible.
The urgency with which this was announced was accompanied by a lack of detail: security advisories (or CVE's) were missing. To this date (29/09), one of the two that lead to this fix is still not very clear. See CVE 2023-42452 and CVE 2023-42451 for details. As it stands now, risks are client side rather than server-side, but at the time of the Breakage I didn't realize that.
Lesson learned: Security updates are important. Thinking before installing them, and being careful, is just as important.

Act II: Breakage

As there was no yunohost approved version of the security patch available, I created my own version. However this version did not install properly as it failed a consistency check, leaving the system broken.
The yunohost community hadn’t released 4.1.8 yet, so I created my own forked version. That is relatively easy and I had succesfully done so with earlier versions already. When verifying the source of the update, a checksum needs to be computed and specified to make sure the downloaded sources are ok. Here it went wrong: without enough coffee and in the early morning I used md5sum instead of sha256sum to compute the checksum and put the wrong value in the script. This obviously meant that when installing, the checksum was incorrect and the upgrade aborted. The system was left in a failed state. 
The mastodon install had been removed; but the database was still available, as was the file cache (in a temporary folder called .sys.tmp). These files would have been copied back after a succesful upgrade.
Lesson learned: Coffee first, update later. Always - even when doing something you have done befoe - write out your plan. Test. Test. Test.
  

Act III: Restore! Restore!

When trying to restore a working system, things went wrong because I felt under pressure to be online again as soon as possible. Only at the third attempt did I succeed in getting a reasonable working copy back online.

With the system in a failed state, my first instinct was to save what I had. I had made a database backup already, so the accounts and statuses were save, as were the important configuration files. I also still had the files in sys.tmp, containing the media like avatars, headers and images. I copied these over to a new folder, but ran out of disk space about half way. With the volume full, I created a new one to make a new backup and copied over the system folder. In hindsight, somewhere here not everything was copied over, leading about half the denizens of the drum losing their avatars, and also losing other media files.

With the backup now on a volume mounted safely (read-only) somewhere else, I started on reinstalling mastodon via yunohost. However: reinstall didn't work and threw up all kind of error messages. I did not keep a full log of this so I cannot fully comprehend what went wrong.
As another option, I could however create a new virtual machine from one of the system images Hetzner made. So that was what I did; that server came up (with a different IP address, so DNS changes were necessary in between) and I could mount a copy of the backed up volume (yes, we're up to three 200GB volumes now and two VPS's). The Drum was back up!
But the fun didn't last long: soon it became apparent the backup I restored was six days old (instead of the newest one, which is the last one in the list, I selected the first one in the list). I did have a running drum though, so better to keep that one up for a while.
I decided to rebuild a complete new VPS with the newest yunohost and a new mastodon install. When that had completed, I loaded the database backup into that server - which worked on the second try (Guys, when the manual states you have to *drop* the old database first, they mean it). With another volume copy mounted, this time we're up for real and the switchover could happen - necessitating another DNS change. Oh, and due to HSTS testing this before going live is a real bum, as I could only issue the Let's Encrypt HTTPS-certificate after the DNS-change.

Lesson learned: Have a good recovery plan. Think it through beforehand. Test restores.
 

Act IV: Aftermath

With the drum back up, I really appreciated your kind words. Some problems are still persisting: logging in with 2-factor authentication was solved a couple of days later, the missing local avatars unfortunately needed user action but all users have been DM'ed. Missing remote data (avatars, images) should be solved when the 4.2.0 update is installed.
Another thing that became apparant is that it's important to have somebody to talk to - a one-person admin team is a no-person admin team. Luckily, Iain (@bigcalm) volunteered immediately and has been installed as a second admin. With three moderators (@happydisciple, @djdarren and @tho99) we're up to a five-person team now.

Lesson learned: Have more than one admin.

Thank you for reading so far and for your patience.

Patrick