Skip links

Reliability And Failure

The image above shows what our website looked like from about 1PM on Wednesday until about 2PM yesterday. Our email was also intermittently failing during the same period. There are huge differences between the issue of reliability in computer services and structural engineering, but there’s enough similarities to make some discussion worthwhile.

As a small company, we do not run our own server for internet use. We have been working with A2 Hosting for about six years and are generally very happy with their service. Before anyone gets upset – this is not a post about how we’re now not happy with them. The problem we had this week is not their fault and they responded quickly and efficiently when I notified them that there was a problem. A2 provides our internet domain, our email, and our website, which is built using a program called WordPress.

We’ve been receiving notices for a while that our service was going to be migrated to a new server. Usual computer stuff: new hardware, better, faster, more reliable, etc. That all seemed good. My experience with this kind of thing is that we were going to have some outages during the actual changeover from one server to another, but that’s okay. If an email is delayed by an hour, nothing bad happens, as that’s the nature of asynchronous communication.

The email outages during the change were based mostly on the change in the server name. We had to reconfigure our email client apps to use the new server name, which is a slightly finicky and annoying operation. We only really lost email for a short period of time, although the Mail app on our computers was down for longer, but we can and did switch to using the web interface for mail at that time. So far, so good – a triumph of good planning.

When the server change was completely, around 5 PM on Wednesday, I got an email saying everything would be back to normal in an hour or two, after the DNS change (the method used to convert our web name “oldstructures.nyc” to the 12-digit number that is our actual web address) made its way around the internet. That did not happen. Our site stubbornly remained, as seen above, a completely white screen. After some emails and a long phone call with a midwestern-polite (A2 is in Ann Arbor) and very helpful technician, the problem was solved.

WordPress is basically a fancy database containing text and pictures, put up as a web page in HTML according to rules defined by the reasonably-user-friendly interface for creating pages and blog posts. All of that data – tens of thousands of files – had been migrated from the old server to the new one. Except, as it turns out, for one file. The configuration file for the database, the thing that tells a request from someone’s web browser where to look for the date, was literally empty. That file is not very big, but it has a critical role and somehow it wasn’t copied properly during the migration. When the A2 tech re-copied it, our entire website came back from the white void immediately.

One of the core principles of engineering is that everything fails. In structural design we literally use failure as a starting point: the design of a beam or a column is based on the demand (the loads that will be placed on it) and the capacity (how much it can carry before failure). But neither the loads nor the capacity are known exactly. In simplistic terms, the loads vary and can be thought of as an asymmetrical bell curve with the design load as the peak, and the capacity varies and is a somewhat-less asymmetric bell curve with the design capacity as the peak. We keep the capacity curve higher than the demand curve, but the tails of the curves always overlap a bit. In short: in structural design, where failure has potentially terrible consequences, we can’t completely rule it out. There may be a situation where load is unexpectedly high and capacity is reduced.

As anyone who has used a computer knows, copying files is an imperfect procedure. It’s pretty good, but once in a while something goes wrong. I have no idea why, since I don’t know enough about the way the hardware and software work, but I’ve experienced it enough over the last 40 years to know it is true. It so happens that the file that was improperly copied was a critical one for WordPress and so shut down the whole site. It’s entirely possible, verging on likely, that there were other less-important files that were improperly copied. I’ll never know until someone tells me that a blog post from 2014 isn’t displaying properly. I will argue that some failures of copying files were inevitable, and we’re fortunate that having our website down for a day was ultimately not harmful to anyone.

Tags: