Thursday, April 10, 2008

Contrasting SimpleDB and GAE Datastore

Part and parcel to the infrastructures that Amazon and Google are promoting are their internal persistence systems.

Let's talk scaling for just a sec here. There are two basic ways that applications can be scaled. Horizontal scaling, and Vertical scaling.

Horizontal scaling is spreading the application across several machines and using various load balancing techniques to spread application traffic across the different machines. If horizontal scaling is appropriate for your application, then if you want to support twice as much load, you can add twice as many machines.

Vertical scaling is using a bigger box to process the load. Here, you have only one instance of the application running, but it's running on a box with more CPUs, more memory, more whatever was limiting you before. Today, a simple example would be moving from a single CPU machine to a dual CPU machine. Ideally the dual CPU machine will double your performance. (It won't for a lot reasons, but it can be close.)

Websites, especially static websites, are particularly well suited to horizontal deployments. If you've ever downloaded anything off the web where they either asked you to select a "mirror", or even automatically select one for you, you can see this process in action. You don't care WHICH machine you hit as long as it has what you're looking for. Mirroring of information isn't "transparent" to the user, but it's still a useful technique. There are other techniques that can make such mirroring or load balancing transparent to the user (for example, we all know that there is not a single machine servicing "www.google.com", but it all looks the same to us as consumers).

Vertical scaling tends to work well with conventional databases. In fact, vertical scaling works well for any application that relies upon locally stored information. Of course, in essence, that's all that a database is. But databases offer a capability that most applications rely upon, and that's a consistency of the view of data that the database contains. Most applications enjoy the fact that if you change the value of a piece of data, when you read that data back it will have the changed value. And, as important, other applications that view that data will see the changed data as well. It's a handy feature to have. And with a single machine hosting the database, it's easy to acheive. But that consistency can really hamper scaling of the database, as they're limited by machine size.

Lets look at a contrived example. Say you have a single database instance, and two applications talking to it. It seems pretty straightforward that when App A makes a change in to the DB, App B would see it as well. Same data, same machine, etc. Simple. But you can also imagine that as you start adding more and more applications talking to that database instance, that eventually it's simply going to run out of capacity to service them all. There will simply not be enough CPU cycles to meet the request.

You can see that if the applications are web applications, as you horizontally scale the web instance, you add pressure to your database instance.

That's not really a bad thing, there are a lot of large machines that run large databases. But those large machines are expensive. You can buy a 1U machine with a CPU in it for less than a $1000. You can buy 25 such machines for less that $25000. But you can't buy a single machine with 25 CPUs for $25000. They're a lot more. If you want to run on cheap hardware, then you need go horizontal.

So, why not add more database instances?

Aye, there's the rub. Lets add another database instance. App A talks to DB A, and App B talks to DB B. A user hits App A, and changes their name, and App A sends the update to DB A. But, now that users data doesn't match on DB B, it has the old data (stale data as it were). How does DB B get synchronized with DB A? And, as important, WHEN does it get synchronized? And what if you have instead of just two instances, you have 25 instances?

THAT is the $64 question. It turns out it's a Hard Problem. Big brainy types have been noodling this problem for a long time.

So, for many applications, the database tends to be the focal point of the scalability problem. Designers and engineers have worked out all sorts of mechanisms to get around the problem of keeping disparate sets of information synchronized.

Now, Amazon and Google are renting out their infrastructure with the goal of providing "instant" scalability. They've solved the horizontal scaling problem, they have a bazillion machines, Amazon will let you deploy to as many as you want, while Google hides that problem from you completely.

But how do they handle the data problem? How do they "fix" that bottleneck? Just because someone can quickly give you a hundred machines doesn't necessarily make solving the scalability issue easier. There's a bunch of hosts out there that will deploy a hundred servers for you.

Google and Amazon, however, offer their own data services to help take on this problem, and they're both unconventional for those who have been working with the ubiquitous Relational Database Systems of the past 30 years.

Both are similar in that they're flexible in their structure, and have custom query languages (i.e. not SQL).

Googles datastore is exposed to the Python programmer by tightly integrating the persistence layer with the Python object model. It's also feature rich in terms of offering different data types, allowing rows to have relationships to each other, etc. Google limits how you can query the data with predefined indexes. You as the developer can define your indexes however your want, but you will be limited to query your data via those indexes. There's no real "ad hoc" query capability supported by the datastore. Also, the Google datastore in transactional in that you can send several changes to the datastore at once, and they wil either all occur "at once", or none of them will occur.

Amazon's SimpleDB is more crude. Each database entry is a bag of multivalued attributes, all of which need to be string data. You as a developer are burdened with converting, say, numbers from string data in to internal forms for processing, then converting them back in to string values for storing. Also, Amazon doesn't allow any relationships among its data. Any relationships you want to make must be done in the application. Finally, SimpleDB is not a transactional system. There seems to be a promise that once the system accepts your change, it will commit the change, but you can't make several changes over time and consider them as a whole.

Finally. there's one other crucial advertised difference between Amazon's and Google's systems. SimpleDB is designed to scale, and exposes that design to the developer. Google's is also, but it offers a different promise to the user.

See, Google appears to be promising consistency across the database. That's all well and good, but as you load down the database, that maintenance has costs. SimpleDB, on the other hand, and interestingly enough, does NOT guarantee consistency. Well, at least not immediately.

For example, read data from the database, say that user record with the user name in it. You can update the data with the new name, and write it back to the database. If you then immediately read it back, you may well get the OLD record with the OLD name. In the example above, you just updated DB A, and read back the data from DB B.

Amazon guarantees that "eventually", your data will be consistent. Most likely in a few seconds.

Now, Google doesn't stipulate that limitation. The API says "update your data and the transaction commits or it doesn't". That implies when you write the data, it's going to be there when you read it back, that your new data will immediately be available.

Now, Amazon, by punting on the integrity and consistency guarantee, they are pushing some of the complexity of managing a distributed application back on to the developer.

In truth this is not such a bad thing. By exposing this capabaility, this limitation, you are forced as a developer to understand the ramifications of having "data in flight" so to speak, knowing that when you look at the datastore, it may be just a wee bit out of date. This capability will definately turn your application design sideways.

In return though, you will have a scalable system, and know how it scales. Building applications around unreliable data on unreliable machines is what distributed computing is all about. That's why it's SO HARD. Two of the great fallacies of network computing is that the network is cheap and reliable, when in fact it's neither. Yet many application developers consider the network as safe, because many idioms make the pain of the network transparent to them, giving the illusion of safety.

Amazons SimpleDB doesn't. They basically guarantee "If you give us some data, we'll keep it safe and eventually you can get it back". That's it. If that "eventually" number is lower than the times between queries, then all looks good. But being aware that there IS a window of potential inconsistency is a key factor in application design.

Now, Google hides this side affect of the database implementation from you. But it does impose another limitation which is basically that your transaction must take less than 5 seconds or it will be rolled back. To be fair, both systems have time limits on database actions, but what is key to the Google promise is that they can use that time window in order to synchronize the distributed data store. The dark side of the 5 second guarantee is not that your request will fail after 5 seconds, but EVERY request can take UP TO 5 seconds to complete.

SimpleDB could have made a similar promise, each commit must take 5 seconds, and use that 5 second window to synchronize your data store, but at the price of an expensive DB request. Instead, the return "immediately", with assurance that at some point, the data will be consistent, meanwhile you can mosey on to other things. What's nice about this is that if it takes more than 5 seconds for the data to become consistent, you as a developer are not punished for it. With Google, your request is rejected if it takes too long. With Amazon, it, well, just takes too long. Whether it take .1 seconds to get consistent or 1 minute, you as a developer have to deal with the potential discrepancy during application design.

Have you ever posted to Slashdot? At the end, after your post, it says "your post will not appear immediately". That's effectively the same promise that Amazon is making.

What it all boils down to is that the SimpleDB request is an asynchronous request (fire and forget, like sending an email), while the Google request is synchronous (click and wait, like loading a web page). They both do the same thing, but by exposing the call details, Amazon gives the developer a bit more flexibility along with more responsability.

But here's the golden egg thats under this goose. Both solutions give you better options as a developer for persisting data than an off the shelf Relational Database, at least in terms of getting the application to scale. Recall that the database tends to be the bottleneck, that lone teller in the crowded bank who everyone in line is cursing.

For large, scaled systems, both of these systems handle a very hard problem and wrap it up in a nice, teeny API that will fit on an index card, and they give it to you on the cheap.

No comments: