Vindication

Several weeks ago, one of the servers that I manage suffered a hard drive failure :(. The machine itself was, ahem, on the low-end of the server-capable scale for Java applications (P3 500MHz with 256MB RAM), and honestly, we were getting a great deal on it. Since the disk failed, and we needed to do some major re-building of the server, I figured we’d get some nicer hardware at the same time.

So, we got a P4 1.8GHz with a 1GB of RAM. All is well, right? Wrong.

Shortly after we got the new server, we started getting strange errors. Some of the Java applications were dying with mysterious signal 11s. When a Java VM dies, it sometimes doesn’t completely die, and you have to murder the process in order to re-start it.

Occationally, Java VMs are instable on certain machines, but those usually turn out to be AMD Athlon machines (and don’t get me wrong, I’d buy an AMD any day over an Intel CPU), and it usually turns out to be a glitch in the VM which gets fixed anyway. Regardless, I thought maybe the particular Java VM release I was using, but it appears to be relatively stable and in use for quite a while, even on Linux.

Then, other weird things started happening. The last few days of backups were all corrupt, and I kept getting this message from bzip2 when testing the archive:

bzip2/libbzip2: internal error number 1007.
This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001.
Please report it to me at: XXXXXXXXXX.  If this happened
when you were using some program which uses libbzip2 as a
omponent, you should also report this bug to the author(s)
of that program.  Please make an effort to report this bug;
timely and accurate bug reports eventually lead to higher
quality software.  Thanks.  Julian Seward, 30 December 2001.

*** A special note about internal error number 1007 ***

Experience suggests that a common cause of i.e. 1007
is unreliable memory or other hardware.  The 1007 assertion
just happens to cross-check the results of huge numbers of
memory reads/writes, and so acts (unintendedly) as a stress
test of your memory system.

Okay, my brand-new hardware sucks. Great. Oh, well, just call-up the hosting provider and get them to change-out the RAM, right? Wrong

Now, let me first say that I’m not going to mention the name of my hosting provider. That’s because they give us (a non-profit organization) a good deal on their services, respond quickly with the same primary contact person every time, and generally seem like they want to keep things moving smoothly for everyone. Unfortunately, this time, things did not go quite so smoothly.

First of all, I was assured that every Intel-compatable machine that comes from their hardware provider has its memory tested with memtest86, which I use myself. They argued that the memory was most likely not the problem because of the aforementined memory testing and the fact that the hardware was all new.

Now, I admit that I fall squarely on the software side of things, and I have the luxury of pretending that hardware works perfectly every time (“What do you mean division doesn’t work on some Pentium Processors?). However, I know when hardware is failing: when Java applications are dying, all my backups are corrupted due to a bug which bzip swears is more often bad memory, and MySQL starts spewing garbage strings when queried (oh, did I not mention that already?!).

Now, I admit there’s another possibility: the problem could be the motherboard or the CPU. Memory errors are not always the fault of the SIMM chips installed. Sometimes, the CPUs L1 or L2 caches can be faulty or the motherboard can have a glitch. Sometimes, the hardware is actually all okay, but the combination doesn’t work out so well (like when you have memory that’s a hair too slow for the mobo).

One of the ‘experts’ at my hosting company suggested that there might be a “hole in my filesystem”. WTF is that? I’ve never heard of anything like that, and neither has google. As far as I’m concerned, if google hasn’t heard of it, then it doesn’t exist ;)

Anyway, I was assured by my expert that, my moving my /usr/local data onto a fresh partition, my problems would magically go away. I simply needed to get the bad Java files off of the corrupted filesystem, and I’d be home free. By removing those files from the filesystem, I’d be essentially fixing my system and things would get better. They didn’t.

“Re-install Java”, was the next attempt. Riiiight. Re-installing Java is just deleting a directory and then untarring the archive. Re-installing Java ain’t gonna do anything. Oh, well. I’ll give it a try. Guess what? The system is so hosed, the Java installer won’t even run without seg faulting. Oops. I shouldn’t have deleted the Java installation. A system running a few Java apps with one crashing every few days is much better than a system with no Java applications running at all.

After about a dozen tries, I got Java installed. I rebooted (that tends to clear up a lot of things) and everything came back up fine.

At this point, I’m planning my strategy for the next mysterious segfault that I see: “Hey, if the hardware’s just fine, then you won’t mind giving me an identical rig and using my flawless hardware for your own servers”. I didn’t have to: my good friend, and usual point of contact at the hosting provider, calls me today to say that he’s going to be in the data center twice today, 5 hours apart. Sounds like a good time to just bite the bullet and go in there and run memtest86 on the rig and see what happens. 5 Hours outta be enough to check it out.

It turns out that he didn’t need that long: the box crapped itself as soon as the memory test started.

Within 48 hours, I’m assured that I’ll have a fresh set of SIMMs and everything should be okay after that. I’m hopeful, but not stupid. In fact, at this point, I’m actually borderline despondent. They’re going to switch out the chips, and the motherboard will be the problem. “But we gave you new RAM, it must be good!” will be the new mantra. Then, we’ll go through the charade again.

I just can’t catch a break.

We went through a bout of this at an old client (again, name witheld) of my former employer, The Adrenaline Group. The IT management decided that it was cheaper to have an army of AMD Athlons than a substantially lower number of Suns for web- and application-servers (Technically, yes, it is cheaper, but Suns handle high load much better).

Anyhow, they chose Penguin Computing as their hardware provider, and we ended up getting a bad batch of machines. We actually had six production servers, six QA servers, and six development servers. As I recall, at least two of each set were bad. We never saw any problems on the development machines because all of our load testing was done in QA, and things looked good. It turns out that the load tests sorely underestimated the load and that the real load caused the machines to work hard enough to crap themselves.

The software development management had us up for a week trying to figure out where we went wrong. They even called-in a pair of consultants from BEA (the app server we were using) to figure out if we had deployment issues, since this was their first BEA deployment. They were as baffled as we were. They were totally flailing.

I even went so far as to suggest that the problem was the hostname of the server, since the same ones went down every time (joking, of course). The manager looked at me with wide-open eyes and said “Really? Do you think it could be that?”. “No,” I replied. “It’s the hardware. Just go into the data center, take that server out of the rack, and throw it into the trash. Repeat if necessary. It’s bad hardware!”. They had the same reaction as this guy at my hosting provider: “It can’t be the hardware.”

I’m a software engineer. I know when it’s the hardware.

Comments are closed.