Monday, April 14, 2008

EC2 Persistent Storage -- It's a crutch.

You must understand, I live locked in a small box, and I'm aloud only this keyboard, a 9" B&W monitor, and a mailbox that only a single person knows about. Thankfully, that person is Chris Herron, and he keeps sending me intersting things.

Specifically, a recent blog post about Amazons new Persistent Storage service for EC2.

What it does is make high bandwidth, high granularity, permanent storage available to EC2 nodes.

One of the characteristics of EC2 is that your instance lives on a normal, everyday Intel machine with CPU, memory, and hard drive. (Actually this is most likely a VM running someplace, not a physical machine instance, but you never know.) But the model of the service is that while all of those capabilities are available to you, and the hard drive is indeed simply a hard drive, the machine that all is contained in can up and vanish at any time.

One minute everything is running along hunky dory, and the next your machine is dead.

Now most folks who do most simple things, rarely lose an entire machine. They might lose access to the machine, like the network going down. They might lose power to the machine. Or, an actual component on the machine may up and fail. But in all of these cases, for most scenarios, when the problem is resolved, the machine comes back to life effectively in the state it was in before the event. A long reboot perhaps. Barring loss of the actual hard drive, these failures, while clearly affecting the availability of the machine (can't use it when you're having these problems), don't really affect the integrity of the maching. A hard drive failure is typically the worst simple failure a machine can have, and simply adding a duplicate drive and mirroring it (which every modern OS can do today) can help protect from that.

The difference with EC2 is that should anything occur to your "machine" in EC2, you lose the hard drive. "Anything" means literally that, anything. And there are a lot more "anythings" within EC2 than a classic hosting environment. Specifically since they don't promise your machine will be up for any particular length of time, you can pretty much be assured that it won't be. And truth is, it doesn't matter what that length of time is, whether it's one day, one week, one month or one year. Whenever the machine goes down, you effectively "lose all your work".

Anyone who has worked in a word processor for several hours only to have the machine restart on you can share in the joy of what it's like to lose "unsaved work". And that is what any data written to the hard drive of an EC2 instance is -- unsaved work. Work that has the lifespan of the machine. Please raise your and if you'd like a years worth of sales, order history, and customer posted reviews to vanish in a heart beat. Anyone? No, of course not.

Amazons original solution was S3, their Simple Storage Service. This is a very coarse service, basically working at the level of not just a single file, but even to the point that you can only replace the entire file rather than update a section of it. You only have simple, streaming read and write functons.

Next, came SimpleDB, which Amazon offers as the next level of granularity. This allows small collections of attributes to be accessed individually. You can query, add, and delete the collections. Much better than S3, but it has it's own issues. Specifically it's "eventual consistency" model. I bet most folks don't enjoy this characteristic of SimpleDB.

The new Peristent Storage service is what everyone has been looking for. Now they can go back to their old model of how computer systems are supposed to work, and they can host a RDBMS just like before. There was nothing stopping folks from running an RDBMS on any of the EC2 instances before, save that nagging "unsaved work" detail. Lose the instance, lose the RDBMS.

I can see why folks have been clamoring for this service, but frankly, I see it as a step backward from the basic tenet that Amazon allows folks to readily build scalable applications.

As noted in earlier, today the most common bottleneck in most scalable systems IS the RDBMS. RDBMS do not easily scale like most other parts of an application. It's the reality of the distributed application problem. And Amazons approach to addressing the problem with SimpleDB is I think admirable.

It's not the solution people want, however. They WANT a scalable RDBMS, and SimpleDB simply is not that beast. But scalable RDBMS's are very, very difficult. All of theses kinds of systems have shortcomings that folks need to work around, and an RDBMS is no different. Amazingly, the shortcomings of distributed RDBMS are much like what SimpleDB offers in terms of "eventual consistency", but the RDBMS will struggle to hide that synchronizing process, like Google's Datastore does.

In the end, SimpleDB is designed the way it is, and is NOT an RDBMS, for a reason, and that's in order to remain performant, and scalable while working on a massive parallel infrastructure. I am fully confident that you can not "Slashdot" SimpleDB. This is going to be one difficult beast to take down. However, the price of that resiliency is it's simplicity and it's "eventual consistency" model.

There's a purity of thought when your tools limit you. One of the greatest strengths, and greatest curses in the Java world are the apparently unlimited number of web frameworks and such available to developers. As a developer when you run in to some shortcoming or issue with one framework, it's easy and tempting, especially early on, to let the eye wander and try some other framework to find that magic bullet that will solve your problem and work the way you work.

But it can also be very distracting. If you're not careful, you find you spend all your time evaluating frameworks rather than actually getting Real Work done.

Now, if you were, say, an ASP.NET programmer, you wouldn't be wandering the streets looking for another solution. You'd simply make ASP.NET work. For various reasons, there are NOT a lot of web frameworks in common use on the .NET world. There, ASP.NET is the hammer of choice so as a developer you use it to solve your problems.

Similarly, if SimpleDB were your only persistence choice, with all of its issues, then you as a developer would figure out clever ways to overcome its limitations and develop your applications, simply because you really had no alternative.

With the new attached Persistent Storage, folks will take a new look at SimpleDB. Some will still embrace it. Some will realize that, truly, it is the only data solution they should be using if they want a scalable solution. But others will go "you know, maybe I don't need that much scalability". SimpleDB is such a pain to work with compared to the warm, snuggly comfort of familiarity that an RDBMS is, folks will up and abandon SimpleDB.

With that assertion, they'll be back to running their standard applicaton designs for their limited domains. To be fair, the big benefit of the new storage is that generic, everyday applications can be ported to the EC2 infrastructure with mostly no changes.

The downside is that these applications won't scale, as they won't embrace the architecture elements that Amazon offers to enable their applications to scale.

I think the RDBMS option that folks are going to be giddy about is a false hope.

As I mentioned before, RDBMS's are typically, and most easily, scaled in a vertical fashion. In most any data center, the DB machine is the largest, and most powerful. If there's a machine in the datacenter with multiple, redundant networks, power supplies, hard drives, cpus, etc. If there is any single machine that's designed to survive a common component failure, and continue running, it's the DB machine.

For example, ten web machines talking to a single DB machine. Any one of those web machines can fail and the system "works". Kill the DB machine, and the rest become room heaters.

The crux here, tho, is all of those machines in the EC2 cloud are basically like those web machines. Cheap, plentiful, and unreliable. The beauty of having easy access to lots of machines is the ability to lose them and still run. However, if you plan on running a DB on one of these, just be aware that it's just as fragile and unreliable as the rest. This machine WON'T have all the redundant features and such. It can come and go as quickly as the rest.

But, being THE DB machine, means that's not what you want. Lose a web server, eh, big deal. Lose the DB, BIG DEAL, the site is down and the phones are ringing.

Also, you're limited to the vertical scaling ability of their instances. Amazon offers larger nodes for hosting, more cpus, etc. You can vertically scale to a point with Amazon. But once you hit that boundary, you're stuck.

In Amazon land, if you want a long term, reliable, and performant granular data storage, then you should be looking at SimpleDB, not the Persistent Storage. I think that if you want to build a scalable system on Amazon, you should work around the SimpleDB offering. The best use for the Persistent Storage and an RDBMS would be for things like a decision support database, something that is more powerful when it works with off the shelf tools (like ODBC) and that doesn't have the scaling needs of the normal application.

So, don't let the Peristent Storage offering fool you and distract you from the the real problems if you want a scalable system.

Friday, April 11, 2008

GAE - Java?

At TheServerSide, there is an article about "Java is losing the battle for the modern web." It's chock full of bits and particles, at least indirectly, about issues with hosting web applications.

That's what GAE does. It hosts web applications. Currently, it hosts Python applications, but it can most likely host most anything, once they get a solid API for talking to the Datastore written for the environment.

Well, I should qualify that.

It can host most anything that works well with the idiom of "start process, handled request, stop process". As I mentioned before, GAE is running the CGI model of web application. This is counter to how Java applications run, however.

Java promotes executing virtual, deployable modules within a long running server process, typically a Java Servlet container. Most containers have mechanisms to support the loading, unloading, and reloading of these deployable modules. You can readily support several different web applications within a single Java Servlet container. As a platform, it's actually quite nice.

Java itself also promotes this kind of system. Java relies on long running processes to improve performance. Specifically, Java relies on "Just In Time" (JIT) compilers to translate JVM bytecode in to native code. The magic of the JIT compiler is that it can observe the behavior of some Java code, and dynamically compile only the parts the JIT compiler feels are worthy of being compiled.

For example, say you have a class that has some non trivial initialization code that runs when an instance is constructed, but it also has some compute intensive methods. If you only create one instance of that class, and execute the compute intensive methods, the JIT will convert those methods in to native code, but will most likely not convert the logic in the constructor. It only runs once and isn't worth the expense of converting in order to make it run faster.

So, Java is a kind of system that gets faster the longer it's run. As the JIT observes things, and gets some idle time, it will over time, and incrementally make the system faster via its compiler. Over time, more and more of the system is converted in to native code. And, it so happens, very good native code.

The issue, however, is that in the case of something like GAE, a system such as Java is almost at complete odds with the environment that GAE promotes. GAE wants short processes, and lots of them, rather than large, single, long running processes.

So is Java completely out of the picture for something like GAE?

Actually, no I don't think so.

Just because Java LIKES long running processes, doesn't mean it's unusable without them. For example, the bulk of the Java community uses the Ant tool to build their projects. That's a perfect example of a commonly used, short term process in Java. Even javac is a Java program.

Java is perfectly usable in a short term scenario. Java could readily be used for CGI programs. What CAN'T (at least now) be used for CGI programs are typical Java web applications. They're just not written for the CGI environment. They rely on long running processes: in memory sessions, cached configuration information, Singleton variables. You most certainly wouldn't do something silly like launch Tomcat to process a single server request. That's just plain insane.

As a rule, Java tends to be "more expensive" to start than something like Perl or PHP. The primary reason is that most of Java is written in Java. Specifically the class library. So, in order to Do Anything, you need to load the JVM, and then start loading classes. Java loads over 280 classes just to print "Hello World" (mind these 280 classes come for only 3 files). All of that loading has some measure of overhead. I well imagine that the code path between process start, and "Hello World" is longer in Java than in, say, Perl. That code path is startup time.

Of course, in modern web applications, startup time is almost irrelevant. Why? Because almost everyone embeds the scripting language in to the web server. That's what mod_perl and mod_php do. They make the actual language interpreter and runtime 1st class citizens of the web server process. This is in distinction to starting a new process, loading the interpreter, loading and executing your code, and then quitting. Apache will pay the interpreter startup cost just once, when Apache starts. There may be some connection oriented initialization when Apache forks a new process to handle a connection, but those connection processes are long lived as well.

So, it turns out, when you're running your language interpreter within the web server, startup time is pretty much factored out of the equation. Unless it's unrealistically long, startup time is a non-issue with embedded runtimes.

Which brings the question "Where is mod_java?" Why not embed Java? And that's a good question. I know it's been discussed in the past, but I don't know if there's reasonable implementation of embedding the JVM within an Apache process.

What does mod_java need to do? The best case scenario would be for an embedded JVM to start up with the host Apache process. The JVM would then load in client classes on request, execute them, and return the result. The last thing the JVM would do is toss aside all of the code it just loaded. It would do that via a special ClassLoader responsible for loading everything outside of the core JVM configuration. This helps the JVM stay reasonably tidy from run to run.

The cool thing about this, is that the code that this JVM runs could readily use the Servlet API as its interface. The Servlet API has the concept of initializing Servlets, executing requests, and unloading Servlets. It also has the concept of persistent sessions that last from request to request. Obviously, most containers are long running, so those lifecycle methods are rarely invoked. Also, most folks consider sessions to be "in memory". Applications would need to adapt their behavior to assume that these lifecycle methods are called all the time, and that your sessions are going to be written to and read from persistent storage every single request. So, you'd want web applications that have fast servlet initialization times and that store little session data.

But those applications can still live under the purview of the standard Java Servlet API.

That means that you could have mod_java, and CGI style web apps, with JSPs and everything.

Most of the standard web frameworks would be out the window, most have long startup and configuration times. But if the idiom becomes popular, no doubt some CGI friendly frameworks would pop up, changing the dynamic of that one time configuration, perhaps being more lazy loading about it.

But would this kind of system perform?

Sure it would. In fact, it would likely perform better (in terms of pure application or script execution) than things like Perl or PHP. Why? Because Perl and PHP have to load and parse the script text every single request. Java just has to load bytecodes. Python has a similar pre-parsed form that can speed loading as well.

In this way, it turns out you can run Java safely, within a process model, you can run it quickly (though most likely not as quick as a long running process), you get to use all of the bazillion lines of off the shelf library code, and even still use the fundamental Servlet API as well.

Things like Hibernate, EJB, and most of the web frameworks would not apply however, so it will be a different model of Java development. But it IS Java, and all of the advantages therein.

And if you want to instead run JRuby, Jython, Javascript, Groovy, or any other Java based scripting langauge, knock yourself out. In that case, it would be best to have the mod_java perform some preload of those systems when Apache starts up, so they can be better ready to support scripting requests at request time with little spin up.

You would also have to limit the CGI Java processes from running things like threads, I would think. The goal is for the JVM to remain pristine after each request.

Google could readily incorporate such a "mod_java" into the GAE and make Java available to users. They can do this without having to reengineer the JVM.

There is one JVM change that would make this mod_java that much better, and that's the capability for the JVM to both grow dynamically in memory, and also free memory up back to the OS. I know JRockit can dynamically grow, I do not know if it can dynamically shrink.

If the JVM could do that, then there's no reason for the "cheap hosts" to not provide this style of Java capability on their servers, as hosting Java becomes little different than hosting PHP.

And wouldn't that be exciting?

Thursday, April 10, 2008

Contrasting SimpleDB and GAE Datastore

Part and parcel to the infrastructures that Amazon and Google are promoting are their internal persistence systems.

Let's talk scaling for just a sec here. There are two basic ways that applications can be scaled. Horizontal scaling, and Vertical scaling.

Horizontal scaling is spreading the application across several machines and using various load balancing techniques to spread application traffic across the different machines. If horizontal scaling is appropriate for your application, then if you want to support twice as much load, you can add twice as many machines.

Vertical scaling is using a bigger box to process the load. Here, you have only one instance of the application running, but it's running on a box with more CPUs, more memory, more whatever was limiting you before. Today, a simple example would be moving from a single CPU machine to a dual CPU machine. Ideally the dual CPU machine will double your performance. (It won't for a lot reasons, but it can be close.)

Websites, especially static websites, are particularly well suited to horizontal deployments. If you've ever downloaded anything off the web where they either asked you to select a "mirror", or even automatically select one for you, you can see this process in action. You don't care WHICH machine you hit as long as it has what you're looking for. Mirroring of information isn't "transparent" to the user, but it's still a useful technique. There are other techniques that can make such mirroring or load balancing transparent to the user (for example, we all know that there is not a single machine servicing "www.google.com", but it all looks the same to us as consumers).

Vertical scaling tends to work well with conventional databases. In fact, vertical scaling works well for any application that relies upon locally stored information. Of course, in essence, that's all that a database is. But databases offer a capability that most applications rely upon, and that's a consistency of the view of data that the database contains. Most applications enjoy the fact that if you change the value of a piece of data, when you read that data back it will have the changed value. And, as important, other applications that view that data will see the changed data as well. It's a handy feature to have. And with a single machine hosting the database, it's easy to acheive. But that consistency can really hamper scaling of the database, as they're limited by machine size.

Lets look at a contrived example. Say you have a single database instance, and two applications talking to it. It seems pretty straightforward that when App A makes a change in to the DB, App B would see it as well. Same data, same machine, etc. Simple. But you can also imagine that as you start adding more and more applications talking to that database instance, that eventually it's simply going to run out of capacity to service them all. There will simply not be enough CPU cycles to meet the request.

You can see that if the applications are web applications, as you horizontally scale the web instance, you add pressure to your database instance.

That's not really a bad thing, there are a lot of large machines that run large databases. But those large machines are expensive. You can buy a 1U machine with a CPU in it for less than a $1000. You can buy 25 such machines for less that $25000. But you can't buy a single machine with 25 CPUs for $25000. They're a lot more. If you want to run on cheap hardware, then you need go horizontal.

So, why not add more database instances?

Aye, there's the rub. Lets add another database instance. App A talks to DB A, and App B talks to DB B. A user hits App A, and changes their name, and App A sends the update to DB A. But, now that users data doesn't match on DB B, it has the old data (stale data as it were). How does DB B get synchronized with DB A? And, as important, WHEN does it get synchronized? And what if you have instead of just two instances, you have 25 instances?

THAT is the $64 question. It turns out it's a Hard Problem. Big brainy types have been noodling this problem for a long time.

So, for many applications, the database tends to be the focal point of the scalability problem. Designers and engineers have worked out all sorts of mechanisms to get around the problem of keeping disparate sets of information synchronized.

Now, Amazon and Google are renting out their infrastructure with the goal of providing "instant" scalability. They've solved the horizontal scaling problem, they have a bazillion machines, Amazon will let you deploy to as many as you want, while Google hides that problem from you completely.

But how do they handle the data problem? How do they "fix" that bottleneck? Just because someone can quickly give you a hundred machines doesn't necessarily make solving the scalability issue easier. There's a bunch of hosts out there that will deploy a hundred servers for you.

Google and Amazon, however, offer their own data services to help take on this problem, and they're both unconventional for those who have been working with the ubiquitous Relational Database Systems of the past 30 years.

Both are similar in that they're flexible in their structure, and have custom query languages (i.e. not SQL).

Googles datastore is exposed to the Python programmer by tightly integrating the persistence layer with the Python object model. It's also feature rich in terms of offering different data types, allowing rows to have relationships to each other, etc. Google limits how you can query the data with predefined indexes. You as the developer can define your indexes however your want, but you will be limited to query your data via those indexes. There's no real "ad hoc" query capability supported by the datastore. Also, the Google datastore in transactional in that you can send several changes to the datastore at once, and they wil either all occur "at once", or none of them will occur.

Amazon's SimpleDB is more crude. Each database entry is a bag of multivalued attributes, all of which need to be string data. You as a developer are burdened with converting, say, numbers from string data in to internal forms for processing, then converting them back in to string values for storing. Also, Amazon doesn't allow any relationships among its data. Any relationships you want to make must be done in the application. Finally, SimpleDB is not a transactional system. There seems to be a promise that once the system accepts your change, it will commit the change, but you can't make several changes over time and consider them as a whole.

Finally. there's one other crucial advertised difference between Amazon's and Google's systems. SimpleDB is designed to scale, and exposes that design to the developer. Google's is also, but it offers a different promise to the user.

See, Google appears to be promising consistency across the database. That's all well and good, but as you load down the database, that maintenance has costs. SimpleDB, on the other hand, and interestingly enough, does NOT guarantee consistency. Well, at least not immediately.

For example, read data from the database, say that user record with the user name in it. You can update the data with the new name, and write it back to the database. If you then immediately read it back, you may well get the OLD record with the OLD name. In the example above, you just updated DB A, and read back the data from DB B.

Amazon guarantees that "eventually", your data will be consistent. Most likely in a few seconds.

Now, Google doesn't stipulate that limitation. The API says "update your data and the transaction commits or it doesn't". That implies when you write the data, it's going to be there when you read it back, that your new data will immediately be available.

Now, Amazon, by punting on the integrity and consistency guarantee, they are pushing some of the complexity of managing a distributed application back on to the developer.

In truth this is not such a bad thing. By exposing this capabaility, this limitation, you are forced as a developer to understand the ramifications of having "data in flight" so to speak, knowing that when you look at the datastore, it may be just a wee bit out of date. This capability will definately turn your application design sideways.

In return though, you will have a scalable system, and know how it scales. Building applications around unreliable data on unreliable machines is what distributed computing is all about. That's why it's SO HARD. Two of the great fallacies of network computing is that the network is cheap and reliable, when in fact it's neither. Yet many application developers consider the network as safe, because many idioms make the pain of the network transparent to them, giving the illusion of safety.

Amazons SimpleDB doesn't. They basically guarantee "If you give us some data, we'll keep it safe and eventually you can get it back". That's it. If that "eventually" number is lower than the times between queries, then all looks good. But being aware that there IS a window of potential inconsistency is a key factor in application design.

Now, Google hides this side affect of the database implementation from you. But it does impose another limitation which is basically that your transaction must take less than 5 seconds or it will be rolled back. To be fair, both systems have time limits on database actions, but what is key to the Google promise is that they can use that time window in order to synchronize the distributed data store. The dark side of the 5 second guarantee is not that your request will fail after 5 seconds, but EVERY request can take UP TO 5 seconds to complete.

SimpleDB could have made a similar promise, each commit must take 5 seconds, and use that 5 second window to synchronize your data store, but at the price of an expensive DB request. Instead, the return "immediately", with assurance that at some point, the data will be consistent, meanwhile you can mosey on to other things. What's nice about this is that if it takes more than 5 seconds for the data to become consistent, you as a developer are not punished for it. With Google, your request is rejected if it takes too long. With Amazon, it, well, just takes too long. Whether it take .1 seconds to get consistent or 1 minute, you as a developer have to deal with the potential discrepancy during application design.

Have you ever posted to Slashdot? At the end, after your post, it says "your post will not appear immediately". That's effectively the same promise that Amazon is making.

What it all boils down to is that the SimpleDB request is an asynchronous request (fire and forget, like sending an email), while the Google request is synchronous (click and wait, like loading a web page). They both do the same thing, but by exposing the call details, Amazon gives the developer a bit more flexibility along with more responsability.

But here's the golden egg thats under this goose. Both solutions give you better options as a developer for persisting data than an off the shelf Relational Database, at least in terms of getting the application to scale. Recall that the database tends to be the bottleneck, that lone teller in the crowded bank who everyone in line is cursing.

For large, scaled systems, both of these systems handle a very hard problem and wrap it up in a nice, teeny API that will fit on an index card, and they give it to you on the cheap.

Wednesday, April 9, 2008

Application Deployment -- the new Models

As many have heard, Google has announced a new hosting service. The call it the Google App Engine.

Over on the TheServerSide, they're comparing GAE and Amazons EC2.

Meanwhile, recently on JavaLo^H^H^H^H^H^HDZone, there was lamenting about cheap Java hosting.

EC2 is very interesting. They managed to effectively out grid Suns Grid, at least for most people. EC2 makes commodity computing even more commodity. It rapidly eliminates more of the scaling barriers folks have. Suns system is more geared to higher performance, short term, High CPU jobs, whereas EC2 is being used for actual hosting.

For $72/month, you end up with a pretty reliable, persistent system with EC2. You need to structure your application to leverage the Amazon infrastructure, but the pay off is high.

Beyond the infrastructure details, Amazon offers a pretty standard development model. The infrastructure does force some design constraints (specifically the machine you're running on can be yanked out from underneath you at a moments notice, so don't store anything Important on it), but once you tweak your dynamic persistence model, EC2 offers up a straight forward deployment environment.

But GAE is different. GAE has turned the model upside down and has, in fact, rewound the clock back to web development circa 1996.

Essentially, GAE promotes the Old School CGI model for web development. Specifically, it's embracing the classic old Unix Process model for web development. This is contrary to much of the work today on threaded server architectures, notably all of the work done in Apache 2, as well as the stock Java model of application development and deployment as well as the new "Comet" application architecture.

See, threads live in a shared environment and are lighter weight in terms of CPU processing for switching between tasks. Simply put, a server can support more independent running threads than it can support independent running processes, and the time to switch between them is less. That means a threaded system can support more processes, and will have a faster response time.

But, the cost of threads is you lose some safety. The threads all share a parent process. If one thread manages to somehow corrupt the parent process, then not only is the individual thread impacted, but so are all of the shared threads.

You can see this in a Java server by having a single thread allocate too much memory, or consume too much CPU where the only resolution is to restart the Java process. If that server were supporting several hundred other connections, all of those are reset and interrupted.

With a process model, each process can be readily limited by the host operating system. They can have their resources easily curtailed (amount of memory, how much disk they use, total CPU time, etc.). When a process violates it's contract with the OS, the OS kills it without a second thought. The benefit there is that your overall server is more stable, since the death of a process rarely affects other running processes.

But if threading is more efficient, why would Google go back to the process model? Can't threads be made safe?

Sure, threads can be made safe. If processes can be made safe, threads can be made safe. It's just a lot more work.

However, here is the crux of the matter. Threads make more efficient use of the host CPU. That's a given. But what if CPU scarcity is not a problem? What if conserving CPU resources is no longer a driver? What if overall response time, system stability and scalability are more important than CPU efficiency?

For Google, CPUs are "free". Granted, I imagine that Google spends more on powering and hosting CPUs than it does on payroll (I don't know, that's just a guess), but wrangling thousands of individual CPUs is a routine task at Google, and they have power to spare.

Using that model, here's how the GAE system turns the application hosting space on its head.

First, unlike Amazon, they're offering you an environment with a bit of disk space and some bandwidth. Your application can't serve up anything other than its own files, or anything else available over HTTP. Your application can not change it's space that it's deployed in (they have the Datastore for that). Your application has hard run time parameters place upon it.

Also, and most importantly, your application has no promise as to WHERE it is being run. You have NO IDEA what machine any individual request will be executed upon. Every other hosting model out there is selling A Machine. GAE is selling you some CPU, somewhere, anywhere, and some bandwidth.

All of your actual applications are running within hard processes on some server somewhere, yet all of your data is served up from dedicated, and different, static data servers. This lets Google leverage the threading model for things like static resources (where it's very good), but use the process model for your applications (which is very safe).

What can Google do with this infrastructure? Simply put, it can scale indefinitely. Imagine a huge array of CPUs, all sharing a SAN hosting your application. Any of those CPUs can run your application. If you get one web hit, no problem. If you get Slashdotted, again, no problem. Since you're not on a single machine, your application WILL scale with demand. One machine, or a hundred machines, makes no difference.

As always, the hardest hit part will be the data store. But I have to imagine that Google has "solved" this problem as well, to a point. Their Datastore has to be as distributed as it can be. We'll have to wait and see more about how this works out.

Where does Java fit in this? Well, it doesn't. Unless you want to run CGI apps. Java web apps are basically long running processes, and those simply don't exist on the Google infrastructure.

Exciting times. I fully expect to see startups, OSS Frameworks, and other papers written up on this infrastructure so it can be duplicated. There's no reason folks can not clone the GAE (or even Amazon) network apis and host this kind of infrastructure in house. Long term, there will be no "vendor lock in" I don't think.

Welcome to my Blog

Some folks have said "Hey Will, you should have a blog." For some reason, I seem to have listened to them and this is the result.

No presumptions, no promises, we'll see where this thing goes.

Follow along and join in if you're so inclined.