Wednesday, April 9, 2008

Application Deployment -- the new Models

As many have heard, Google has announced a new hosting service. The call it the Google App Engine.

Over on the TheServerSide, they're comparing GAE and Amazons EC2.

Meanwhile, recently on JavaLo^H^H^H^H^H^HDZone, there was lamenting about cheap Java hosting.

EC2 is very interesting. They managed to effectively out grid Suns Grid, at least for most people. EC2 makes commodity computing even more commodity. It rapidly eliminates more of the scaling barriers folks have. Suns system is more geared to higher performance, short term, High CPU jobs, whereas EC2 is being used for actual hosting.

For $72/month, you end up with a pretty reliable, persistent system with EC2. You need to structure your application to leverage the Amazon infrastructure, but the pay off is high.

Beyond the infrastructure details, Amazon offers a pretty standard development model. The infrastructure does force some design constraints (specifically the machine you're running on can be yanked out from underneath you at a moments notice, so don't store anything Important on it), but once you tweak your dynamic persistence model, EC2 offers up a straight forward deployment environment.

But GAE is different. GAE has turned the model upside down and has, in fact, rewound the clock back to web development circa 1996.

Essentially, GAE promotes the Old School CGI model for web development. Specifically, it's embracing the classic old Unix Process model for web development. This is contrary to much of the work today on threaded server architectures, notably all of the work done in Apache 2, as well as the stock Java model of application development and deployment as well as the new "Comet" application architecture.

See, threads live in a shared environment and are lighter weight in terms of CPU processing for switching between tasks. Simply put, a server can support more independent running threads than it can support independent running processes, and the time to switch between them is less. That means a threaded system can support more processes, and will have a faster response time.

But, the cost of threads is you lose some safety. The threads all share a parent process. If one thread manages to somehow corrupt the parent process, then not only is the individual thread impacted, but so are all of the shared threads.

You can see this in a Java server by having a single thread allocate too much memory, or consume too much CPU where the only resolution is to restart the Java process. If that server were supporting several hundred other connections, all of those are reset and interrupted.

With a process model, each process can be readily limited by the host operating system. They can have their resources easily curtailed (amount of memory, how much disk they use, total CPU time, etc.). When a process violates it's contract with the OS, the OS kills it without a second thought. The benefit there is that your overall server is more stable, since the death of a process rarely affects other running processes.

But if threading is more efficient, why would Google go back to the process model? Can't threads be made safe?

Sure, threads can be made safe. If processes can be made safe, threads can be made safe. It's just a lot more work.

However, here is the crux of the matter. Threads make more efficient use of the host CPU. That's a given. But what if CPU scarcity is not a problem? What if conserving CPU resources is no longer a driver? What if overall response time, system stability and scalability are more important than CPU efficiency?

For Google, CPUs are "free". Granted, I imagine that Google spends more on powering and hosting CPUs than it does on payroll (I don't know, that's just a guess), but wrangling thousands of individual CPUs is a routine task at Google, and they have power to spare.

Using that model, here's how the GAE system turns the application hosting space on its head.

First, unlike Amazon, they're offering you an environment with a bit of disk space and some bandwidth. Your application can't serve up anything other than its own files, or anything else available over HTTP. Your application can not change it's space that it's deployed in (they have the Datastore for that). Your application has hard run time parameters place upon it.

Also, and most importantly, your application has no promise as to WHERE it is being run. You have NO IDEA what machine any individual request will be executed upon. Every other hosting model out there is selling A Machine. GAE is selling you some CPU, somewhere, anywhere, and some bandwidth.

All of your actual applications are running within hard processes on some server somewhere, yet all of your data is served up from dedicated, and different, static data servers. This lets Google leverage the threading model for things like static resources (where it's very good), but use the process model for your applications (which is very safe).

What can Google do with this infrastructure? Simply put, it can scale indefinitely. Imagine a huge array of CPUs, all sharing a SAN hosting your application. Any of those CPUs can run your application. If you get one web hit, no problem. If you get Slashdotted, again, no problem. Since you're not on a single machine, your application WILL scale with demand. One machine, or a hundred machines, makes no difference.

As always, the hardest hit part will be the data store. But I have to imagine that Google has "solved" this problem as well, to a point. Their Datastore has to be as distributed as it can be. We'll have to wait and see more about how this works out.

Where does Java fit in this? Well, it doesn't. Unless you want to run CGI apps. Java web apps are basically long running processes, and those simply don't exist on the Google infrastructure.

Exciting times. I fully expect to see startups, OSS Frameworks, and other papers written up on this infrastructure so it can be duplicated. There's no reason folks can not clone the GAE (or even Amazon) network apis and host this kind of infrastructure in house. Long term, there will be no "vendor lock in" I don't think.

No comments: