GlusterFS - A work horse that needs to be tamed
I'm sure by now most of you will have heard of GlusterFS, which allows you to store data on a very large scale, replicated, striped, or both – across multiple physical boxes. At the face of it, and if you believe the marketing, it is THE most reliable and fastest solution. And yes indeed, it has got massive potential, and it has matured a lot over the years since I last wrote about it. However, it still has got a few nasty pitfalls, which you need to be aware of before deploying it into a production environment. You should really test thoroughly how it copes with your workload, and how your applications and infrastructure behave in case of failure.
What is GlusterFS, and what is it not?
You can think of GlusterFS as a RAID device, which works across the boundaries of a single physical disk array. Take RAID-1 for example, which mirrors data between two identical disks. In GlusterFS's jargon, you run two bricks in replicate mode, where a brick is defined as storage in general terms; it can be an array of disks (which could use RAID), a single disk, a partition, a directory. Anything that can be mounted into your filesystem hierarchy qualifies as a brick. The key feature of GlusterFS is now, to treat bricks on different physical machines as one volume, which can be accessed by any number of clients. It can be mounted either via Fuse/GlusterFS client, or even via NFS or CIFS/Samba. You can use RAID-0 style striping for read speed, RAID-1 style mirroring for real-time replication, RAID-10 for both, or you can go beyond any of those and spread the stripes or mirrors across any number of bricks. 4-node replication? No problem at all. GlusterFS gives you truly enormous flexibility and performance when it comes to making large amounts of data available across multiple nodes. Since version 3.2 (if I'm not mistaken), they have even added GeoReplication, which allows a Master/Slave setup, where the slave can be a local or remote site. Be it for backups or to have a standby version of your application in a different geographical location… it's possible. Due to the fact that GeoReplication does not require locking or synchronous replication, the network speed to your remote site isn't that important either. It copes well with it.
This sounds very different than for example a DRBD/GFS2 or DRBD/OCFS2 setup, doesn't it? And indeed it is! GlusterFS, unlike DRBD, is not providing a block device. What it means is that it compares hashes of files, and if files on nodes differ (for example after a failure), it will copy entire files across, not only the changed blocks. In normal day-to-day operation that's not a big problem, in particular as you get a lot of flexibility, which is unmatched by other solutions. Where it does make a difference is during recovery. More on that in the Caveats section. A variety of different connectors
I mentioned earlier that you can use a couple of different ways to connect to your GlusterFS volumes. First, there's their own GlusterFS client, which uses the kernel's Fuse layer. This client is Gluster's recommendation, if your workload requires a high amount of fast write operations. If your workload is more about reading small files quickly, they recommend NFS. (The NFS server is part of the glusterd daemon, which serves the volumes to the clients.) Samba/CIFS is probably mainly targeting Windows clients.
All these connectors have their advantages and disadvantages. You want to test that thoroughly for your particular workload. Also, in SELinux environments, you will require some tweaking of your policies, if you use the GlusterFS client, whereas NFS is a lot more straight forward (don't forget that apache needs to be allowed to access NFS directly if that's your intention; setsebool -P httpd_use_nfs=on is your friend). I know most people find it easier to switch off SELinux altogether, but for me personally that is never an option. I'd rather spend hours tweaking the SELinux policies, if necessary. Anyhow, the bottom line is that both NFS and CIFS make GlusterFS very attractive for platforms beyond Linux. FreeBSD for example, although I'm not sure if the native client has reached a production-ready state there yet; I shall give that a spin soon, and in the meantime NFS will do.
As a rule of thumb you can say that high-availabily, robustness, scalability etc always come with a downside: write performance. During write operations, all nodes need to be kept in sync, which means that the weakest "link" (or slowest disk for that matter) together with some locking and network/protocol overhead determines the actual write speed. That is normal. (Note: pure throughput must not be confused with the time it takes to actually be able to access a file on a different node than it was written to)
For that reason you can never expect a high availability file system to solve all your problems. There's no such thing as "one size fits it all". Your application need to be cluster/HA aware. In practice that means you will have to select carefully which type of information you store where. This is of course true for GlusterFS, too. However, when it comes to read performance, GlusterFS is actually very fast. Not as fast as a local block device, obviously, but personally I wasn't able to tell the difference between native NFS and Gluster's NFS implementation. The GlusterFS client (fuse/glusterfs, not NFS) however seems to be a little bit slower reading data, while being faster writing. It really depends on your workload. Bottom line is: GlusterFS is fast and flexible, which alone is a big plus over many other solutions. For maximum read performance you can of course use stripes (data scattered across multiple nodes), which the glusterfs client connects to simultaneously. It's kind of obvious that in particular big files benefit from such a setup.
If you intend to deploy GlusterFS, you better plan a serious amount of time for the first tests, integration into your setup, including benchmarks and failover. GlusterFS is powerful and not too difficult to get started with, but you'll soon run into various rather specific questions, which aren't documented well (or not at all). Quite frankly the online documentation is poor, or rudimentary. Obviously Gluster, a business, wants to sell their expertise, and there's nothing wrong with it. So be prepared to browse mailing list archives or hang out in #gluster in irc.freenode.net or so.
GlusterFS has matured a lot over the last years, and you certainly don't need to be worried about losing data (after all it's filesystem based and you can copy anything out of the bricks' directories directly, if you wish). However, some major issues and pitfalls still exist.
- If you reintroduce or replace a node, which was either faulty or offline for a while, the self-healing will transfer entire files back from up-2-date nodes onto the reintroduced one. This consumes a lot of network bandwidth, and even worse, CPU load (possibly due to the hash comparison). If a GlusterFS brick lives on a box together with other services, you will experience a significant performance hit.
- Large files are locked while being replicated. In practice that means that you really can't use GlusterFS as a backend for VMs at the moment, unless recovery always happens in a controlled manner at times where you can afford to shut down running VMs for the entire duration of the healing. That somehow defeats the purpose of a high-availability storage cluster.
- However, a GlusterFS engineer has told me earlier today on irc.freenode.net that this issue will be tackled in GlusterFS 3.3, if not earlier. Only a question of months, I suppose.
- You absolutely must synchronise the system time of all bricks. If you're not doing that already anyway, do it before deploying GlusterFS. (use NTP for your own sanity)
- Make sure that the bricks of one volume are of identical size and that you don't by mistake fill the disk space by other means. I had a situation the other day where I wanted to replace a brick; what I didn't realise first was that someone set a disk quota on the new brick. Consequently it stopped writing long before all data could be copied. However, GlusterFS did not warn me, nor did it report an error; it actually confirmed successful migration, although only 1/3 of the files were transferred!
- Clearly the lack of accessible disk space wasn't GlusterFS's fault, and is probably not a common scenario either, but it should spit out at least an error message. Imagine what would have happened if I had taken the other node offline after allegedly successful migration! Total mess.
Presumably none of these things would have happened, if I had taken their commercial offerings. :-) Those of you who prefer D.I.Y., better be prepared to spend a serious amount of time to fit it into your use-case and more importantly… monitor it closely!
GlusterFS has made a lot of positive progress over the last 2-3 years. It's very easy to get started, especially on RHEL/CentOS, and it offers enormous flexibility and opportunities. The new CLI makes basic configuration much much easier than it used to be before. With a few simple commands you can create your volumes (on multiple servers, aka "peers", simultaneously). You could say that it's actually fun to use GlusterFS!
However, if you (like me) are looking at GlusterFS as a backend for Xen or VMware VMs in order to facilitate live-migration and resilience, you will probably need to wait for version 3.3, unless controlled recovery with planned downtime is an option for you. Might be worth keeping an eye on their Git repository (I certainly will). While using it to serve files for all sorts of things already, I'm really looking forward to using it as a backend for Xen soon! :)
Version 3.3 brings some other new promising features, too… Unified storage, object storage… I see memcached on the list of dependencies… looks promising. Beta 1 is out, by the way.