GlusterFS, a workhorse that needs to be tamed

31/07/2011 – 08:51 pm

I’m sure by now most of you will have heard of GlusterFS, which allows you to store data on a very large scale, replicated, striped, or both – across multiple physical boxes. At the face of it, and if you believe the marketing, it is THE most reliable and fastest solution. And yes indeed, it has got massive potential, and it has matured a lot over the years since I last wrote about it. However, it still has got a few nasty pitfalls, which you need to be aware of before deploying it into a production environment. You should really test thoroughly how it copes with your workload, and how your applications and infrastructure behave in case of failure.

What is GlusterFS, and what is it not?

You can think of GlusterFS as a RAID device, which works across the boundaries of a single physical disk array. Take RAID-1 for example, which mirrors data between two identical disks. In GlusterFS’s jargon, you run two bricks in replicate mode, where a brick is defined as storage in general terms; it can be an array of disks (which could use RAID), a single disk, a partition, a directory. Anything that can be mounted into your filesystem hierarchy qualifies as a brick. The key feature of GlusterFS is now, to treat bricks on different physical machines as one volume, which can be accessed by any number of clients. It can be mounted either via Fuse/GlusterFS client, or even via NFS or CIFS/Samba.  You can use RAID-0 style striping for read speed, RAID-1 style mirroring for real-time replication, RAID-10 for both, or you can go beyond any of those and spread the stripes or mirrors across any number of bricks. 4-node replication? No problem at all. GlusterFS gives you truly enormous flexibility and performance when it comes to making large amounts of data available across multiple nodes.
Since version 3.2 (if I’m not mistaken), they have even added GeoReplication, which allows a Master/Slave setup, where the slave can be a local or remote site. Be it for backups or to have a standby version of your application in a different geographical location… it’s possible. Due to the fact that GeoReplication does not require locking or synchronous replication, the network speed to your remote site isn’t that important either. It copes well with it.

This sounds very different than for example a DRBD/GFS2 or DRBD/OCFS2 setup, doesn’t it? And indeed it is! GlusterFS, unlike DRBD, is not providing a block device. What it means is that it compares hashes of files, and if files on nodes differ (for example after a failure), it will copy entire files across, not only the changed blocks. In normal day-to-day operation that’s not a big problem, in particular as you get a lot of flexibility, which is unmatched by other solutions. Where it does make a difference is during recovery. More on that in the Caveats section.

A variety of different connectors

I mentioned earlier that you can use a couple of different ways to connect to your GlusterFS volumes. First, there’s their own GlusterFS client, which uses the kernel’s Fuse layer. This client is Gluster’s recommendation, if your workload requires a high amount of fast write operations. If your workload is more about reading small files quickly, they recommend NFS. (The NFS server is part of the glusterd daemon, which serves the volumes to the clients.) Samba/CIFS is probably mainly targeting Windows clients.

All these connectors have their advantages and disadvantages. You want to test that thoroughly for your particular workload. Also, in SELinux environments, you will require some tweaking of your policies, if you use the GlusterFS client, whereas NFS is a lot more straight forward (don’t forget that apache needs to be allowed to access NFS directly if that’s your intention; setsebool -P httpd_use_nfs=on is your friend). I know most people find it easier to switch off SELinux altogether, but for me personally that is never an option. I’d rather spend hours tweaking the SELinux policies, if necessary. Anyhow, the bottom line is that both NFS and CIFS make GlusterFS very attractive for platforms beyond Linux. FreeBSD for example, although I’m not sure if the native client has reached a production-ready state there yet; I shall give that a spin soon, and in the meantime NFS will do.

Performance

As a rule of thumb you can say that high-availabily, robustness, scalability etc always come with a downside: write performance. During write operations, all nodes need to be kept in sync, which means that the weakest “link” (or slowest disk for that matter) together with some locking and network/protocol overhead determines the actual write speed. That is normal. (Note: pure throughput must not be confused with the time it takes to actually be able to access a file on a different node than it was written to)

For that reason you can never expect a high availability file system to solve all your problems. There’s no such thing as “one size fits it all”. Your application need to be cluster/HA aware. In practice that means you will have to select carefully which type of information you store where. This is of course true for GlusterFS, too. However, when it comes to read performance, GlusterFS is actually very fast. Not as fast as a local block device, obviously, but personally I wasn’t able to tell the difference between native NFS and Gluster’s NFS implementation. The GlusterFS client (fuse/glusterfs, not NFS) however seems to be a little bit slower reading data, while being faster writing. It really depends on your workload. Bottom line is: GlusterFS is fast and flexible, which alone is a big plus over many other solutions. For maximum read performance you can of course use stripes (data scattered across multiple nodes), which the glusterfs client connects to simultaneously. It’s kind of obvious that in particular big files benefit from such a setup.

Caveats

If you intend to deploy GlusterFS, you better plan a serious amount of time for the first tests, integration into your setup, including benchmarks and failover. GlusterFS is powerful and not too difficult to get started with, but you’ll soon run into various rather specific questions, which aren’t documented well (or not at all). Quite frankly the online documentation is poor, or rudimentary. Obviously Gluster, a business, wants to sell their expertise, and there’s nothing wrong with it. So be prepared to browse mailing list archives or hang out in #gluster in irc.freenode.net or so.

GlusterFS has matured a lot over the last years, and you certainly don’t need to be worried about losing data (after all it’s filesystem based and you can copy anything out of the bricks’ directories directly, if you wish). However, some major issues and pitfalls still exist.

  • If you reintroduce or replace a node, which was either faulty or offline for a while, the self-healing will transfer entire files back from up-2-date nodes onto the reintroduced one. This consumes a lot of network bandwidth, and even worse, CPU load (possibly due to the hash comparison). If a GlusterFS brick lives on a box together with other services, you will experience a significant performance hit.
  • Large files are locked while being replicated. In practice that means that you really can’t use GlusterFS as a backend for VMs at the moment, unless recovery always happens in a controlled manner at times where you can afford to shut down running VMs for the entire duration of the healing. That somehow defeats the purpose of a high-availability storage cluster.
    However, a GlusterFS engineer has told me earlier today on irc.freenode.net that this issue will be tackled in GlusterFS 3.3, if not earlier. Only a question of months, I suppose.
  • You absolutely must synchronise the system time of all bricks. If you’re not doing that already anyway, do it before deploying GlusterFS. (use NTP for your own sanity)
  • Make sure that the bricks of one volume are of identical size and that you don’t by mistake fill the disk space by other means. I had a situation the other day where I wanted to replace a brick; what I didn’t realise first was that someone set a disk quota on the new brick. Consequently it stopped writing long before all data could be copied. However, GlusterFS did not warn me, nor did it report an error; it actually confirmed successful migration, although only 1/3 of the files were transferred!
    Clearly the lack of accessible disk space wasn’t GlusterFS’s fault, and is probably not a common scenario either, but it should spit out at least an error message. Imagine what would have happened if I had taken the other node offline after allegedly successful migration! Total mess.

Presumably none of these things would have happened, if I had taken their commercial offerings. :-)  Those of you who prefer D.I.Y., better be prepared to spend a serious amount of time to fit it into your use-case and more importantly… monitor it closely!

Summary

GlusterFS has made a lot of positive progress over the last 2-3 years. It’s very easy to get started, especially on RHEL/CentOS, and it offers enormous flexibility and opportunities. The new CLI makes basic configuration much much easier than it used to be before. With a few simple commands you can create your volumes (on multiple servers, aka “peers”, simultaneously). You could say that it’s actually fun to use GlusterFS!

However, if you (like me) are looking at GlusterFS as a backend for Xen or VMware VMs in order to facilitate live-migration and resilience, you will probably need to wait for version 3.3, unless controlled recovery with planned downtime is an option for you. Might be worth keeping an eye on their Git repository (I certainly will). While using it to serve files for all sorts of things already, I’m really looking forward to using it as a backend for Xen soon! :)

Version 3.3 brings some other new promising features, too… Unified storage, object storage… I see memcached on the list of dependencies… looks promising. Beta 1 is out, by the way.

 

  1. 9 Responses to “GlusterFS, a workhorse that needs to be tamed”

  2. Nice writeup. The point about replication performance is an especially good one. That’s why one of the development tasks for the next release of CloudFS (which is based on GlusterFS) is likely to be a new replication translator which is based on fundamentally different principles than AFR. Its features will include no additional lock/xattr operations in the non-failure case (only overhead will be the duplication of the write request itself) and fully automatic self-heal taking time proportional to the number of partially completed requests (instead of to the total number of files).

    There are many more features planned, in addition to those that Gluster themselves add during the same time. Please check out http://cloudfs.org/2011/07/status-report-2011-07 and http://cloudfs.org/2011/07/data-integrity, and let us know which features interest you the most. Thanks!

    By Jeff Darcy on Aug 2, 2011

  3. Hi Jeff, thanks for your comment. I should add though, that in the few days since I wrote that, the GlusterFS team has become very active and published a patch for the latest beta, which implements so-called granular locking. That actually resolves the problem with frozen VMs during healing. I was one of the first people to get their hands on the source to test. You certainly know about it already as you hang out in their IRC channel all day, but for others this might be interesting :-)
    http://community.gluster.org/p/centos-6-rpms-for-3-3beta-with-granular-locking/
    I’ll check out your CloudFS nonetheless. Is it, and will it remain open source? (hopefully a rhetorical question)

    By admin on Aug 6, 2011

  4. Yes, HekaFS (the project previously known as CloudFS) will remain open source. Development is funded by Red Hat, with cooperation from Gluster, and both companies are thoroughly committed to open source. There are some issues with respect to who will own and support certain parts (e.g. the SSL/multi-threaded transport) but since the licenses are synchronized that doesn’t affect whether it will be open or not.

    By Jeff Darcy on Aug 10, 2011

  5. Thanks once more for your reply Jeff!
    To answer your initial question as to which features I’d be interested in most: CLI-based management, which actual creates meaningful yet compact output, ideally easy to parse so that it can be fed into monitoring systems like Zabbix or Nagios. I’m a monitoring fetishist, quite frankly. If I don’t know exactly what’s happening where in my environment, I get nervous, and GlusterFS’s CLI is just not ideal (you can of course script around it, but that’s not optimal, and GlusterFS just changes too rapidly to keep up with that).
    Another thing is documentation. Quite frankly Gluster’s documentation is minimalistic. Since the CLI is there, it got even worse, because you don’t usually need to access configuration files directly. However, here again, I do need to know what’s happening and why… :)
    And last but not least: Encryption is a brilliant thing to implement. Good that you agree there as it’s already on your feature list :)

    One thing, which you guys might not put your focus on for obvious reasons is FreeBSD support. I’d love to see working connectors other than NFS for FreeBSD.

    By admin on Aug 13, 2011

  6. Thanks for that. I want to use Gluster for my Blu-Ray Disks and planing to install 8x 19″ Server with each of 12TB. What i didnt find is a hint of best way to use the max size of all disks. should i run all disks on each server in raid and striped this to gluster. i would lost 1 disk of each server, or should is there a possibility to do that over all server like raid5 or raid6? what happens if one brick lost, or can i add simple one more brick with more than 12tb? need answers thx.

    By sunghost on Jan 24, 2012

  7. Not sure if I understand your intentions correctly. Read-only media in Gluster? That doesn’t sound right :)
    Best to check the gluster IRC channel on Freenode or their community websites to get specific advice, I suppose.

    By admin on Feb 15, 2012

  8. Thanks for the excellent write-up, Carsten. I’ve been looking into using GlusterFS to create a highly-available, parallel storage cluster for storing virtual disk images but had not realized that GlusterFS’ healing function would prevent the use of the VM’s during recovery of a failed/downed node.

    I look forward to your next instalment and hope that you’ll continue to address this particular application of GlusterFS.

    By Eric Pretorious on May 13, 2012

  9. Hi Eric. Thanks for your comment. A patch was made available shortly after I wrote this article, and by now it should be in the GlusterFS codebase. On top of that, since RedHat bought GlusterFS, I’m convinced they are interested in resolving this, because it would be very helpful when it comes to running KVM in scaling setups. (I’m sure you guys at VMware monitor your competition closely :-))

    By admin on May 13, 2012

  10. Suppose host1, host2 have two available disk partitions sdb1, sdc1 dedicated as gluster bricks.
    In 3.0.5 you can, with some work, RAID1{host1/sdb1, host2/sdb1} and RAID1{host1/sdc1, host2/sdc1} and then RAID0 these together. Is 3.2+ gluster smart enough to accept these four bricks on two hosts and, given a replication factor of 2, arrange things as before, with the result than one host going down would be perfectly ok, as would be with 3.0.5?

    By jimd on Sep 12, 2012

Post a Comment

*