Monday, April 14, 2008

EC2 Persistent Storage -- It's a crutch.

You must understand, I live locked in a small box, and I'm aloud only this keyboard, a 9" B&W monitor, and a mailbox that only a single person knows about. Thankfully, that person is Chris Herron, and he keeps sending me intersting things.

Specifically, a recent blog post about Amazons new Persistent Storage service for EC2.

What it does is make high bandwidth, high granularity, permanent storage available to EC2 nodes.

One of the characteristics of EC2 is that your instance lives on a normal, everyday Intel machine with CPU, memory, and hard drive. (Actually this is most likely a VM running someplace, not a physical machine instance, but you never know.) But the model of the service is that while all of those capabilities are available to you, and the hard drive is indeed simply a hard drive, the machine that all is contained in can up and vanish at any time.

One minute everything is running along hunky dory, and the next your machine is dead.

Now most folks who do most simple things, rarely lose an entire machine. They might lose access to the machine, like the network going down. They might lose power to the machine. Or, an actual component on the machine may up and fail. But in all of these cases, for most scenarios, when the problem is resolved, the machine comes back to life effectively in the state it was in before the event. A long reboot perhaps. Barring loss of the actual hard drive, these failures, while clearly affecting the availability of the machine (can't use it when you're having these problems), don't really affect the integrity of the maching. A hard drive failure is typically the worst simple failure a machine can have, and simply adding a duplicate drive and mirroring it (which every modern OS can do today) can help protect from that.

The difference with EC2 is that should anything occur to your "machine" in EC2, you lose the hard drive. "Anything" means literally that, anything. And there are a lot more "anythings" within EC2 than a classic hosting environment. Specifically since they don't promise your machine will be up for any particular length of time, you can pretty much be assured that it won't be. And truth is, it doesn't matter what that length of time is, whether it's one day, one week, one month or one year. Whenever the machine goes down, you effectively "lose all your work".

Anyone who has worked in a word processor for several hours only to have the machine restart on you can share in the joy of what it's like to lose "unsaved work". And that is what any data written to the hard drive of an EC2 instance is -- unsaved work. Work that has the lifespan of the machine. Please raise your and if you'd like a years worth of sales, order history, and customer posted reviews to vanish in a heart beat. Anyone? No, of course not.

Amazons original solution was S3, their Simple Storage Service. This is a very coarse service, basically working at the level of not just a single file, but even to the point that you can only replace the entire file rather than update a section of it. You only have simple, streaming read and write functons.

Next, came SimpleDB, which Amazon offers as the next level of granularity. This allows small collections of attributes to be accessed individually. You can query, add, and delete the collections. Much better than S3, but it has it's own issues. Specifically it's "eventual consistency" model. I bet most folks don't enjoy this characteristic of SimpleDB.

The new Peristent Storage service is what everyone has been looking for. Now they can go back to their old model of how computer systems are supposed to work, and they can host a RDBMS just like before. There was nothing stopping folks from running an RDBMS on any of the EC2 instances before, save that nagging "unsaved work" detail. Lose the instance, lose the RDBMS.

I can see why folks have been clamoring for this service, but frankly, I see it as a step backward from the basic tenet that Amazon allows folks to readily build scalable applications.

As noted in earlier, today the most common bottleneck in most scalable systems IS the RDBMS. RDBMS do not easily scale like most other parts of an application. It's the reality of the distributed application problem. And Amazons approach to addressing the problem with SimpleDB is I think admirable.

It's not the solution people want, however. They WANT a scalable RDBMS, and SimpleDB simply is not that beast. But scalable RDBMS's are very, very difficult. All of theses kinds of systems have shortcomings that folks need to work around, and an RDBMS is no different. Amazingly, the shortcomings of distributed RDBMS are much like what SimpleDB offers in terms of "eventual consistency", but the RDBMS will struggle to hide that synchronizing process, like Google's Datastore does.

In the end, SimpleDB is designed the way it is, and is NOT an RDBMS, for a reason, and that's in order to remain performant, and scalable while working on a massive parallel infrastructure. I am fully confident that you can not "Slashdot" SimpleDB. This is going to be one difficult beast to take down. However, the price of that resiliency is it's simplicity and it's "eventual consistency" model.

There's a purity of thought when your tools limit you. One of the greatest strengths, and greatest curses in the Java world are the apparently unlimited number of web frameworks and such available to developers. As a developer when you run in to some shortcoming or issue with one framework, it's easy and tempting, especially early on, to let the eye wander and try some other framework to find that magic bullet that will solve your problem and work the way you work.

But it can also be very distracting. If you're not careful, you find you spend all your time evaluating frameworks rather than actually getting Real Work done.

Now, if you were, say, an ASP.NET programmer, you wouldn't be wandering the streets looking for another solution. You'd simply make ASP.NET work. For various reasons, there are NOT a lot of web frameworks in common use on the .NET world. There, ASP.NET is the hammer of choice so as a developer you use it to solve your problems.

Similarly, if SimpleDB were your only persistence choice, with all of its issues, then you as a developer would figure out clever ways to overcome its limitations and develop your applications, simply because you really had no alternative.

With the new attached Persistent Storage, folks will take a new look at SimpleDB. Some will still embrace it. Some will realize that, truly, it is the only data solution they should be using if they want a scalable solution. But others will go "you know, maybe I don't need that much scalability". SimpleDB is such a pain to work with compared to the warm, snuggly comfort of familiarity that an RDBMS is, folks will up and abandon SimpleDB.

With that assertion, they'll be back to running their standard applicaton designs for their limited domains. To be fair, the big benefit of the new storage is that generic, everyday applications can be ported to the EC2 infrastructure with mostly no changes.

The downside is that these applications won't scale, as they won't embrace the architecture elements that Amazon offers to enable their applications to scale.

I think the RDBMS option that folks are going to be giddy about is a false hope.

As I mentioned before, RDBMS's are typically, and most easily, scaled in a vertical fashion. In most any data center, the DB machine is the largest, and most powerful. If there's a machine in the datacenter with multiple, redundant networks, power supplies, hard drives, cpus, etc. If there is any single machine that's designed to survive a common component failure, and continue running, it's the DB machine.

For example, ten web machines talking to a single DB machine. Any one of those web machines can fail and the system "works". Kill the DB machine, and the rest become room heaters.

The crux here, tho, is all of those machines in the EC2 cloud are basically like those web machines. Cheap, plentiful, and unreliable. The beauty of having easy access to lots of machines is the ability to lose them and still run. However, if you plan on running a DB on one of these, just be aware that it's just as fragile and unreliable as the rest. This machine WON'T have all the redundant features and such. It can come and go as quickly as the rest.

But, being THE DB machine, means that's not what you want. Lose a web server, eh, big deal. Lose the DB, BIG DEAL, the site is down and the phones are ringing.

Also, you're limited to the vertical scaling ability of their instances. Amazon offers larger nodes for hosting, more cpus, etc. You can vertically scale to a point with Amazon. But once you hit that boundary, you're stuck.

In Amazon land, if you want a long term, reliable, and performant granular data storage, then you should be looking at SimpleDB, not the Persistent Storage. I think that if you want to build a scalable system on Amazon, you should work around the SimpleDB offering. The best use for the Persistent Storage and an RDBMS would be for things like a decision support database, something that is more powerful when it works with off the shelf tools (like ODBC) and that doesn't have the scaling needs of the normal application.

So, don't let the Peristent Storage offering fool you and distract you from the the real problems if you want a scalable system.

2 comments:

Unknown said...

Will,

I partly agree with you, but I think that the EC2 DB scaling issue will be addressed sooner or later. As a spectator, Amazon seems to be putting the pieces of the puzzle together bit by bit.

I think that the RDBMS shortcoming you bring up will be solved by 3rd Party services on top of EC2/S3, if not by Amazon itself.

For example, take a look at RightScale - they already support clustered MySql on EC2 with failover (backed by S3 - bet that doesn't run too quick!). It sounds like they will support the persistent storage soon.

As for vertical scaling - that is something that Amazon has control over also. They can just increase increase the available power of the instances.

I don't think its any coincidence that this was announced so soon after GAE. It'll be interesting to see what Google does for Java web apps.

Will Hartung said...

Of course. In theory if you want to run Oracle RAC on EC2, then the Persistent Storage will work fine. The Persistent Storage enable customers to use the current crop of off the shelf solutions used to get RDBMSs to scale.

My point is that RDMSs have issues scaling for a reason. Also, to be most useful for an applications, RDBMSs need to be very reliable.

As I mentioned, in most any modern deployment scenario, the single machine that gets the most attention to reliability is the database server.

But, with EC2, the RDBMS server is just like any other Joe in the rack. All of these instances are the same. Which means that if you wish to take advantage of an RDBMS within EC2, then you will need to take extra steps to work around the fact that the RDBMS machine is as unreliable as any other machine within your grid.

If you plan on deploying a single machine running your application and database, obviously this is not an issue at all.

Amazon can down your machine at any time, for any reason: hardware failure, hardware maintenance, load balancing, etc. Most applications treat their DB machine as being especially reliable, but EC2 does not.

You DB machine is fair game, and as a architect, you simply need to be cognizant of that fact.

On the other hand, SimpleDB is "perfectly" reliable. It will "never" go down, and it will scale as large as you want to go if you're willing or able to work within it's characteristic of "eventual consistency".

I just think that if you want to create massively scalable applications (and granted, not every application needs this, most, in fact, don't) on Amazons infrastructure, then you should embrace SimpleDB and its design rather than relying on an RDBMS.

Having the RDBMS available is a good thing for Amazon, but I think it lets architects off the hook as far as properly designing their applications for the architecture that Amazon provides.