WildBill's Blogdom

Mongo only pawn, in game of life.

Since I Have Nothing Else to Do Right Now…

| Comments

I’ll rant a little about high-tech stuff in general.

At work we’ve just finished investing in a lot of gear and software. Hardware-wise, we’ve gone with HP Blades - specifically, the BL25p, decked out in a dual processor, dual-core 2.0Ghz Opteron configuration, with 16GB of RAM on each blade, and 2x146GB local SCSI disks on each blade. Our blade chassis is fully loaded, which means eight of these suckers humming away. (And Gillette would agree, eight blades beats five anyday.) Of course, we needed a lot of fast storage for these blades, so we went with an EMC CX300 disk array. We’re already running one CX300 and it’s been a champ for about a year and a half, so I figured we’d add to the already happy family.

As far as software goes, we’ve bought into the whole virtualization/consolidation paradigm in a pretty big way, so we purchased eight licenses of VMWare’s flagship product, ESX Server. The vision of taking all our dev/test servers and consolidating them all into eight blades + the EMC was … tempting. Plus, getting all our stuff on a SAN means that disaster recovery is as simple as replicating the SAN, right? RIGHT?

Well, this implementation hasn’t been the … smoothest I’ve ever been through, that’s for sure. We fired up the blades under Knoppix (thanks to Kyle for setting up a kickass PXE environment for all this), and ran a linux-based stress test on them. All the blades passed with flying colors. We get the VMWare tech rep on site to train Kyle and Paul, while I get to be the martyr manager and not attend the class… and two of the blades fail during installation. They flat out lock up. Seeing as how that’s not a GoodThing(TM), we engaged HP.

Long story short, Paul spent a lot of time working with HP, as well as testing various RAM combinations (there were enough weird RAM issues that pointed to RAM being the issue, or a contributor). After much wrangling, he finally got the issue under control and pinpointed two bad sticks of RAM - after going thru and testing more than 80GB of RAM.

All’s well, and we proceed to actually deploying ESX and virtualizing our existing servers. Things went really smooth through this phase, mostly due to Kyle’s diligence and Knoppix-Fu (once again, that’s Knoppix Hacks, available at your local bookstore. While you’re at it, get his latest book too.)

So, anyway, we’ve got three of the blades running ESX, and the remaining five blades are set to temporarily be a testbed to validate our new datacenter deployment. OK, great, so I go and kickstart those blades, while Kyle’s virtualizing a Linux server on another blade, and Brian (one of our Windows admins) is P2V’ing a Windows server.

About a half hour later, ALL the VMs, on ALL the ESX servers, completely and utterly lock up. Totally. Investigation shows that the SAN has completely disappeared from underneath the ESX hosts. SAN H/W diagnostics show no issues with the SAN, yet none of the ESX servers can see the VMFS filesystem on the SAN.

I waste no time opening a priority one case with VMWare. After about an hour of various diagnostics, we notice that the problem is the partition table on that LUN has disappeared. I manually recreate it with fdisk from one of the ESX hosts service console, run vmkfstools -V, and the data’s there again. We happily fire up all the VMs, and life continues, but my confidence in the product is shaken, to say the least.

Fast forward to Wednesday. I am virtualizing yet another server, the Windows guys are P2V’ing one too, and I realize that one of the five blade test hosts isn’t configured right, so Kyle kickstarts that sucker to get it dialed in. Guess what happens? Yep, boom - our SAN drops out again.

I was able to fix it in fairly short order (under two minutes) now that I knew what to do, but now I’m worried. We start stepping through our UNIX and Windows virtualization processes, and those seem fine. But… it strikes me that in both instances one of the other blades was kickstarting to RedHat for that datacenter test. I asked Kyle to investigate our kickstart config, and he finds a line in there (which was originally written by the old admin) that essentially says, “Clear the partition tables of all disks I am attached to.” Since that blade was designed to be an ESX host and had visibility to the same LUN, it would nuke the partition table on the SAN when kickstarted. Having that line in a kickstart config makes sense if it’s a standalone server you’re installing, but when a SAN enters the mix, well, it’s not so good.

Needless to say, we have fixed that little issue by implementing LUN masking and by fixing the kickstart script. I’m just happy that the issue wasn’t a fatal bug in our hardware or in ESX server.

Bottom line is that sometimes problems you face are caused by an external force rather than an internal issue. In this case, it was our kickstart script rather than a flaw in the hardware or software. We all may feel a little sheepish about this, but it could have been a LOT worse. Now that we have the RAM issue and this other ESX issue taken care of, I think (and hope!) we’re out of the woods and we can continue to deploy this as planned.

More Ubuntu Update-manager Shinyness

| Comments

I decided to bite the bullet today and upgrade my fujip from breezy to dapper, using the graphical update manager.

The actual upgrade went without a hitch - it asked me a couple of times to replace conf files I’d modified manually, but everything worked fine during the upgrade.

Post-upgrade, I had to do the following:

  • Manually install network-manager as my custom pkgs were auto-removed - this was fine as I got the latest “real” packages
  • Fix widescreen
    • I was using 855resolution for the built-in Intel video, which has been merged/upgraded to 915resolution. The fix was as easy as apt-get install 915resolution.
  • Fix the audio, wireless, and PCMCIA
    • This was my fault - the default kernel was still set to my “custom” 2.6.12 from breezy, so none of that worked. I fixed /boot/grub/menu.lst and all that stuff “just worked”. :)
  • Rebuild the kernel
    • This is because of my EVDO card - the usbserial module needs patching. Rather than just roll the module, I build the whole kernel - mostly cause I’m too lazy to figure out how to make a .deb of just the module. One day I’ll spend the time to figure that out.

So as far as I can tell, the system is running well. I’ll have to poke at it some more, but I think the upgrade went relatively smooth, compared to my upgrade headaches in the past.

More Picasa Info

| Comments

Just spend a half hour messing with Picasa, and it’s mostly frustrating. Things I’ve noticed:

  • The slideshow functionality does not work at all. It completely scrambles the screen, and doesn’t accept ESC or any other keystroke to get out of it. I had to drop to console and kill the Picasa process to recover.
  • The import has hung twice on me, requiring me to kill -9 the process - it wouldn’t respond to a normal kill.
  • It consumes GOBS of memory. When importing a folder with about 300 pictures in it, it gobbled 1.2GB of virtual memory. Considering this box only has 512MB RAM, I think that’s … bad.

So, I think I’m going to call this a failed experiment, from where I sit. I’d file bugs, but I think it’s a lost cause - I’ll just stick with f-spot. Or if push comes to shove, I’ll use iPhoto on my Mac.

Google Releases Picasa, F-spot Users Cringe.

| Comments

Well, Google looks like it may actually be working on some Linux applications. Google Labs has just released a Linux version of Google’s photo-organizer application, Picasa. While this sounds good on the surface, after digging a little bit it seems that this may be a mostly-standard Windows version that’s Linux-enabled using some form of WINE technology.

While this certainly doesn’t suck by any stretch of the word, I don’t think it’s a real competitor to F-Spot. F-Spot’s a native Linux app (while it is written in Mono), and F-Spot seems faster. I haven’t spent much time mucking with Picasa, but it does seem to be functional. It’s Windows-y feel puts me off however, it doesn’t “look” like it belongs on my Ubuntu Desktop. I do appreciate Google making it available, though – hopefully this is a harbinger to other applications (Google Earth, pls k thx bye!) Here’s a screenshot or two - in the end, you be the judge. I’ll try and spend a little more time with this package and give a more fair review of it in the next day or so.

Funny Picture of the Day…

| Comments

Was surfing old archives of my stuff and found this. I took this while commuting to work, back around the height of the dot-bomb – I’d say 2002 or so (the timestamp on the file is Jun 17 2002).

Gotta say, that’s a creative way to get your resume out there.