WildBill's Blogdom

Mongo only pawn, in game of life.

Since I Have Nothing Else to Do Right Now...

| Comments

I’ll rant a little about high-tech stuff in general.

At work we’ve just finished investing in a lot of gear and software. Hardware-wise, we’ve gone with HP Blades - specifically, the BL25p, decked out in a dual processor, dual-core 2.0Ghz Opteron configuration, with 16GB of RAM on each blade, and 2x146GB local SCSI disks on each blade. Our blade chassis is fully loaded, which means eight of these suckers humming away. (And Gillette would agree, eight blades beats five anyday.) Of course, we needed a lot of fast storage for these blades, so we went with an EMC CX300 disk array. We’re already running one CX300 and it’s been a champ for about a year and a half, so I figured we’d add to the already happy family.

As far as software goes, we’ve bought into the whole virtualization/consolidation paradigm in a pretty big way, so we purchased eight licenses of VMWare’s flagship product, ESX Server. The vision of taking all our dev/test servers and consolidating them all into eight blades + the EMC was … tempting. Plus, getting all our stuff on a SAN means that disaster recovery is as simple as replicating the SAN, right? RIGHT?

Well, this implementation hasn’t been the … smoothest I’ve ever been through, that’s for sure. We fired up the blades under Knoppix (thanks to Kyle for setting up a kickass PXE environment for all this), and ran a linux-based stress test on them. All the blades passed with flying colors. We get the VMWare tech rep on site to train Kyle and Paul, while I get to be the martyr manager and not attend the class… and two of the blades fail during installation. They flat out lock up. Seeing as how that’s not a GoodThing(TM), we engaged HP.

Long story short, Paul spent a lot of time working with HP, as well as testing various RAM combinations (there were enough weird RAM issues that pointed to RAM being the issue, or a contributor). After much wrangling, he finally got the issue under control and pinpointed two bad sticks of RAM - after going thru and testing more than 80GB of RAM.

All’s well, and we proceed to actually deploying ESX and virtualizing our existing servers. Things went really smooth through this phase, mostly due to Kyle’s diligence and Knoppix-Fu (once again, that’s Knoppix Hacks, available at your local bookstore. While you’re at it, get his latest book too.)

So, anyway, we’ve got three of the blades running ESX, and the remaining five blades are set to temporarily be a testbed to validate our new datacenter deployment. OK, great, so I go and kickstart those blades, while Kyle’s virtualizing a Linux server on another blade, and Brian (one of our Windows admins) is P2V’ing a Windows server.

About a half hour later, ALL the VMs, on ALL the ESX servers, completely and utterly lock up. Totally. Investigation shows that the SAN has completely disappeared from underneath the ESX hosts. SAN H/W diagnostics show no issues with the SAN, yet none of the ESX servers can see the VMFS filesystem on the SAN.

I waste no time opening a priority one case with VMWare. After about an hour of various diagnostics, we notice that the problem is the partition table on that LUN has disappeared. I manually recreate it with fdisk from one of the ESX hosts service console, run vmkfstools -V, and the data’s there again. We happily fire up all the VMs, and life continues, but my confidence in the product is shaken, to say the least.

Fast forward to Wednesday. I am virtualizing yet another server, the Windows guys are P2V’ing one too, and I realize that one of the five blade test hosts isn’t configured right, so Kyle kickstarts that sucker to get it dialed in. Guess what happens? Yep, boom - our SAN drops out again.

I was able to fix it in fairly short order (under two minutes) now that I knew what to do, but now I’m worried. We start stepping through our UNIX and Windows virtualization processes, and those seem fine. But… it strikes me that in both instances one of the other blades was kickstarting to RedHat for that datacenter test. I asked Kyle to investigate our kickstart config, and he finds a line in there (which was originally written by the old admin) that essentially says, “Clear the partition tables of all disks I am attached to.” Since that blade was designed to be an ESX host and had visibility to the same LUN, it would nuke the partition table on the SAN when kickstarted. Having that line in a kickstart config makes sense if it’s a standalone server you’re installing, but when a SAN enters the mix, well, it’s not so good.

Needless to say, we have fixed that little issue by implementing LUN masking and by fixing the kickstart script. I’m just happy that the issue wasn’t a fatal bug in our hardware or in ESX server.

Bottom line is that sometimes problems you face are caused by an external force rather than an internal issue. In this case, it was our kickstart script rather than a flaw in the hardware or software. We all may feel a little sheepish about this, but it could have been a LOT worse. Now that we have the RAM issue and this other ESX issue taken care of, I think (and hope!) we’re out of the woods and we can continue to deploy this as planned.