Citrix XenServer is great. No really. As long as you don’t want to do uncommon things like, say, replacing a network card which is your management interface, or deleting snapshots and expecting to get the freed space back instantly, XenServer is solid and very easy to setup and use. With a few clicks you can set up VMs with just about any available OS, attach them to a network interface or even VLAN [more on that later], and are only a few more mouse clicks away from starting it. I’ve run various different OS on it: a bunch of Linux flavours, FreeBSD, Solaris, Windows. It runs and runs and runs.
So where’s the but? Here it comes: …but if something unexpected happens, you are seriously screwed. Here are a few examples from the past couple of months.
Changing a NIC, which is also management interface, of a pool server — This was about the worst nightmare I’ve ever had. What you’d expect to do is: shutdown the machine, open it, replace the NIC, close it, switch it on again, wait for it to boot and start the VMs, done. What really happened is: I had to actually wipe and re-install the whole box, because there was apparently no documented, reverse-engineerable, or otherwise known way to just simply change the MAC address somewhere, because that is managed by the pool master. Now, as the NIC was broken, the master wasn’t able to communicate with the pool server any more (not even on the second NIC, because that was not the management interface). Attempts to change it failed. Not even the “xe” tool was functional any more, so I couldn’t really gather the UUIDs in order to search through configurations etc. The master refused to talk to the pool server, and the pool server with the broken (and afterwards replaced NIC) refused to let me change anything, because that should be done on the master. Catch 22.
I consulted the official support forum, but nobody knew an answer there either. I’m sure there is a way to change it easily. After all it’s a Linux box with a modified Xen, but still not an unaccessible blackbox. Hang on… actually it felt a bit like that. I would like to think that Citrix certainly knows an easy solution, but as I’m not paying thousands of Pounds for a product, which is almost entirely based on free software, they of course kept quiet. (The bloody toolstack, which complicated things, is their own development, by the way.)
End of that experience was that I had to remove the server from the pool (XenServer would then wipe the box, so you can’t re-join the pool later, either… awesome). After a clean setup and restoring all the VMs from previously created snapshots, the machine was finally able to join the pool. That was 6 hours after the NIC broke. Fortunately all VMs have an identical twin running on another machine, so it didn’t cause downtime (except a few minor hiccups while I was fiddling about with network settings). Otherwise all websites/applications would have been offline for 6 hours.
Without the XenServer toolstack, I could have resolved the issue within 10 minutes, which includes all of the steps mentioned earlier (what I would have expected).
I learned my lesson from it. As live-migration of VMs isn’t really necessary in most cases (my customers’ applications don’t benefit from it), it’s actually better to not form pools of your servers. Disconnected standalone servers are a lot easier to maintain and you don’t risk side-effects with pool members, because there aren’t any. The only real downside is that VLANs need to be configured individually on each server. Same applies to shared resources (NAS etc). But that’s fine.
Another almost unbelievable example is deleting snapshots. I create them all the time, because if something goes wrong, or someone breaks a VM setup, you want to be able to roll back to a previous version. Snapshots are one of the biggest advantages of virtualisation. A whole VM can be brought back to an older state within seconds. Or you can export it and reimport it elsewhere, clone another instance from it, work there, swing later. Anyways, if you use that feature often, it fills your disk (even the huge disks you get nowadays). So you regularly delete them and get your space back. Right? Nope, wrong. With XenServer you may or may not get your space back. When your monitoring tells you that you are running out of disk space, although you haven’t done anything but rotating snapshots in a while, you scratch your head in disbelief. Well, at least I did. Unfortunately, the official documentation confirms my observations. When I first read that reclaiming space causes downtime, I wasn’t sure if laughing or crying was the best course of action.
In a production environment, you can’t just go ahead and suspend VMs just to get space back. Even if you only reduce performance (without causing downtimes, as we’re running twins of everything), you need to make affected customers aware of it. And how do you explain that? “Sorry, Sir, I need to suspend your service, because I need to delete old snapshots.” They’ll think you’re taking the piss.
Again, this “feature” is brought to you by Citrix’s toolstack, not Xen. If I decide to delete an LVM-based snapshot of a running VM on Xen, I can do that any time. No need to suspend anything or to manually reclaim free space afterwards.
My favourite subject is VLANs. I don’t know how many hours I’ve wasted trying to find what I did wrong, just to figure out in the end that it was not my fault… Citrix apparently manipulated the bridge code and never really tested it. You have to actually install ebtables (iptables for bridges, if you will) to work around that issue. I observed exactly the same thing as the poster there, and many others did, too. Their forums are full of problems related to VLANs and NIC bonding. Problems get worse with two NICs. VLANs may work out-of-the-box on both, only one, or none of the NICs. Apparently it depends on the used NIC (well, I’m assuming here that nobody uses old NICs without VLAN support any more nowadays), which of the NIC is management interface, and a couple of other factors like weather, mood etc.
Once you know about the workaround mentioned earlier, you can solve it. But now, when you update your XenServer version, you can’t rely on Citrix. They might just remove the required kernel modules so that ebtables wouldn’t work any more. Sounds unlikely? Well, reality is that ebtables did work until XenServer version 5.5, but in 5.6 the kernel support was removed (see last post here). To fix it, you end up downloading the XenServer SDK (which includes all the open source bits they are using) and recompile the kernel yourself.
I won’t go deeper into this subjcet, but there are several issues with bonded NICs as well. And the management interface can never be on a tagged VLAN. All those are restrictions/problems solely related to Citrix’s stuff. Linux itself lets you create any combination of bonds and VLANs on as many interfaces as you want to. Unfortunately, you need to unlearn all about Linux network configuration, because if you try applying your knowledge, XenServer will overwrite your configuration as soon as you reboot (best case) or use its API or Windows client to manage NICs/VLANs.
I could go on and on and on. There are many other quirks like being unable to shutdown a VM when for some reason it can’t attach to a VNC console (but keeps trying, although you absolutely don’t need a console to shut it down); having a “force” option for many commands, which is useless, because it doesn’t force anything; being unable to remove stale shared storage; having to work around limitations which would for example disallow you to build a pool with an i7 920 and an i7 930 server; and quite a few more, which are of minor relevance in a production environment.
Don’t get me wrong. If you dig deep enough, you will find problems in any similarly complex software. And Citrix’s XenServer is not a bad product at all. Much of the functionality like live-migration isn’t available in VMware’s free version ESXi, and said free version doesn’t run on top of CentOS but on a custom Linux, which officially you can’t access via SSH (there are ways though, but you can’t expect any support at all). Also, XenServer’s GUI is self-explanatory and easy to use — and certainly one of the main reasons for using XenServer, because whoever is going to use it after you set it up for them, they won’t have many problems getting started.
However, if you don’t have lesser knowledgeable people using it later, and if you don’t mind going the extra mile, you probably get most flexibility and reliability if you set up Xen instead (the vanilla or “real” one, not XenServer). XenServer doesn’t really provide any additional functionality, which isn’t available in Xen. (Some people even say the opposite is true, and you only get full Xen functionality if you purchase XenServer’s extra licenses; I wouldn’t go that far.) It does add convenience with its GUI and toolstack though, which you’d otherwise have to implement yourself — snapshots, shared storage use, starting up any type of guest OS etc. Most of those things aren’t exactly rocket science; only a few are a bit more tricky. But you can script/automate them as you please and you don’t need to expect any bad surprises caused by 3rd parties.
For example, I disabled Xen’s bridging code (by commenting out a single line in their scripts) and do the whole network configuration with standard OS tools, keeping it independent and consistent for future updates. (More details here.) Snapshots are easy enough to do with LVM, too. Live-migration I haven’t tested yet, but it doesn’t look too difficult to do either. (We don’t really need that feature here anyway)
What I’ve struggled with was getting different OS running, namely FreeBSD. But now that I have sorted that out, I can easily clone and fork more FreeBSD VMs on the vanilla Xen machines. Hence, Citrix XenServer isn’t providing any benefits there either.
As you can see (and as the title suggests), I’m considerably fed up with XenServer’s quirks; some of them are too huge to accept them in production environments. Consequently, we’re going to “migrate” back to Xen, where we can. (Admittedly, in some environments we won’t be able to do that for another year or so.)
Once you’ve worked out how XenServer stores VM backups (yep, they did their own thing there too, and the format is really stupid), it’s not too difficult to convert them. I’ve done that for both CentOS and FreeBSD XenServer images. They run smoothly on vanilla Xen after converting them back.
Once again the “keep it simple” motto wins. Additional toolstacks and bloat cause more problems than necessary, and the manufacturer turns out to be the only one benefitting from it — as often is the case. So long, XenServer — Hello Xen!
(Update: Only three hours after I published this, one of our XenServers started refusing to create new VMs from templates…)
(Update 2: It’s cursed. Yesterday I was all of a sudden unable to attach any block devices, hence I was unable to start new VMs, reboot existing ones, or increase storage. I’m not the only one, who faces that problem and does not get any help from the experts at Citrix.)
(Update 3, Aug 25th: Done. Last weekend we’ve transformed the last remaining XenServers to vanilla Xen. Thanks to the twin-design, this went through without any downtimes whatsoever; was a major piece of work though, but certainly worth it. Chapter closed. )