High Availability and Cloudy Problems

VoltDB, like many distributed systems, achieves high availability through redundant processing nodes. VoltDB calls this K-Safety. Essentially, the distributed system can answer any request at at least K+1 servers, so it can tolerate at least K hardware failures. The operator specifies the value of K that they find is the best tradeoff between failure, robustness and cost. Other systems use the terms “replica set” to describe similar functionality.

Let’s talk about EC2-style clouds that provide you with a virtualized server at an hourly cost. Imagine you want to deploy a VoltDB instance of 3 nodes with K = 2, i.e. all data is replicated to all nodes so that you can tolerate 2 failures before the system becomes unavailable. Since server failures are not terribly frequent, that’s a pretty robust system. You’ve decided to use EC2′s “Large” instances. You provision them from the EC2 console, install VoltDB and start them running.

The Problem

Amazon (and most other public clouds), make no assurances that your three instances aren’t running on the same honkin’ server. It’s possible that a single failure in that server could bring your whole cluster down. Amazon doesn’t offer a direct way to provision multiple instances that are guaranteed to be on separate physical hardware.

Current Options

First, Amazon offers “Availability Zones”. If you provision servers across availability zones, then they will be on different hardware. This has the added benefit of protecting you from entire datacenter failures, but it has some huge downsides.
1. You’re charged for bandwidth between zones. In a system like VoltDB, that can get expensive.
2. Latency between zones is higher and more variable. In a system like VoltDB, that can affect performance.
3. You still can only be as redundant as the number of zones you’re using.

Second, you can use “Cluster Compute” instances on EC2. Amazon says they provision one instance per server, which is ideal. They also give you 10GigE and decent specs, which is great. The downside is they cost much more than the “Large” instances you wanted. The price/gigabyte of memory is also lower than the high-memory instances, which are a better match for VoltDB.

If you’re using a different public cloud, then I’m not even sure what your options are. I don’t think Rackspace gives you any more information than Amazon, and it’s not clear to me if they offer a 1 VM to 1 Server option. [EDIT: Rackspace offers slightly more info than Amazon. See comment below.]

If you’re on a private cloud, you may have a lot more control over the provisioning process. I’m not sure VMWare will let you automatically provision several VMs without putting two VMs on the same server, but I think you can usually do this manually. Just be careful with VMotion.

VoltDB’s Position

For now, if you want to use a public cloud and you care about availability, use a cloud that offers a 1 VM to 1 Server guarantee.

We’re working on a better answer for the future, but the lack of hardware visibility and control in public cloud infrastructures is a real problem for high availability applications like VoltDB. Until cloud providers themselves offer a solution, users will have to make compromises – either by accepting the risks of undesired co-location of redundant resources or by paying the additional costs/latency of cloud infrastructure work-arounds.

The real solution is for public clouds to allow HA applications like VoltDB to detect when two VMs are co-located, and to allow explicit provisioning of N VMs on N servers. We don’t expect this to happen soon.

It seems like many users of the cloud and some vendors touting cloudy products are unaware of this problem or pretend it doesn’t exist. Do they know something we don’t? Does anyone have a better alternative?

John Hugg
Software Engineer
VoltDB

 

Tags: ,

Comments: 4

Leave a reply »

 
  • Anonymous

    Great post if I may say so – thanks for summing up everything so eloquently. It is certainly high time that cloud providers provide us with such guarantees (albeit I fear the additional charges we may be presented with).

    I hope you don’t mind me citing your post in a recent post at Xeround’s blog – http://blog.xeround.com/2010/10/06/eggs-baskets-and-cloudy-guarantees/ – in which I explained how we address this problem in our service.

     
     
     
  • Anonymous

    Its been a while since I’ve dug through the documentation, but while Amazon does not provide a guarantee that instances will be on different physical machines, they make a “best effort” attempt to place instances for one account on different physical machines. There is an academic paper that looked at this issue from a security perspective. They present ways to detect if two instances are on the same physical machine in EC2. This would permit you to use EC2 for building reliable services. Unfortunately, I think some of this has changed since the paper was published, but I’m sure you can still figure this out in some way:

    http://people.csail.mit.edu/tromer/papers/cloudsec.pdf

     
     
     
  • John Hugg

    Thanks! I see that Rackspace does provide this. I wonder how practical the brute force approach ends up being? This is definitely something we’ll look into at VoltDB.

     
     
     
  • Anonymous

    I don’t think Rackspace gives you any more information than Amazon
    It’s non-obvious, but there is a call you can make with the Rackspace API that returns a host-id hash for an instance you’ve started. A brute force approach is to keep trying to start instances until they end up on different hosts.

    It would be AWESOME if they would let you provision N VMs on N servers
    Agreed… very nice idea.

     
     
     
  • Leave a Reply
     
    Your gravatar
    Your Name