Tuesday, July 1, 2008

VM's getting scared?

Over on java.net, they're commenting on whether Google has chosen poorly buy backing the RIA via JavaScript/CSS/DOM/HTML rather than the VM model using Java/Flash/Silverlight.

I look at it from an opposite point of view. Originally I was going to post this as a comment on their site, but for some reason they wouldn't let me.

But I have my OWN voice! Squeak in the wind it may be, I shall NOT BE SILENCED by whimsical corporate overlords. Or, something like that -- geez...not changing the world here.

Anyway.

You can argue that Google and Apple soldiering on with the web platform can be likened to those early pioneers that did the same with the Java Platform, back when it WAS slow, had platform issues, and other teething problems.

If it weren't for the likes of GMail and other RIA browser apps, the browser makers would have less incentive to push the browser as a platform. Yet, now we see that, while not perfect, the browser as RIA runtime is viable for a large class of applications, and it's just getting better.

Witness the improvements to the runtimes both via Adobes new JavaScript runtime, as well as Apples. Plus the new version of JS as a language. Also, we have the DOM changes with things like the better CANVAS tags to handle graphics, as well as improved SVG support.

All of these changes are to drive the platform farther to become more flexible and more performant in order to handle more advanced applications.

Is it perfect? No, of course not. If you want something more robust and fluid than what a browser RIA can provide today, then by all means go the VM route. But there are a lot of valid reasons to stay out of the VM.

VMs add more overhead to an already big system. You still need the browser to launch the application, and when you load that browser, you pretty much get the entire runtime as well. Heck, you can barely launch Flash today properly without JavaScript. So now you pay for both runtimes.

Of course, there's Apples iPhone, which supports neither Java or Flash, but it DOES have a full boat Safari implementation. So, GMail yay, Flex/FX/Silverlight nay.

Finally, you simply have the fragmentation effect. With Flash, Java and Silverlight cutting up the developer pie, while JS/HTML remains a cohesive and reasonably standard cross platform solution.

The number of applications that the browser runtime can support is expanding with every release of the various browsers. The momentum is for browser providers to provide a robust application environment while at the same time improving their unique UI elements for those standard browser tasks. You can not have a successful browser today that doesn't handle the large JS RIA applications.

The browser. It's not just for surfing any more.

Friday, June 27, 2008

Party like it's 1992

Back when I started this whole software thing, the primary interface for users was what was known as the "smart terminal". Glass screen, keyboard, serial port with a DB 25 connector stuck in the back. They had sophisticated elements like bold, blink, and underline text as well as line graphics. Rainbow of colors: white, green, and amber. VT100's and Wyse 50's were most popular and representative of the lot, but anyone who has ever looked at a modern termcap file can see that there were hundreds of these things floating around from every manufacturer.

While the different terminals and their capabilities were novel (I had one that had TWO, count 'em, serial ports making it easy to flip back and forth between sessions), what's more relevant at this point is the idiom of computing they represented.

At the time, it was called time share. Monolithic computers sharing time across several different applications and users. Many were billed by the CPU time they consumed, and computer utilization was a big deal because of the cost of the main computer. Batch processing was more popular because it gave operators better control over that utilization, and ensured that the computer was used to capacity. Idle computer time is wasted computer time, and wasted computer time is unbilled computer time.

As computers became cheaper, they became more interactive, since now it's more affordable to let a computer sit idle waiting for a user to do something next.

The primary power, though, was that you had this large central computer that could be shared. Sally on one terminal can enter data that Bob can then see on his terminal, because all of the data was contained on the single, central computer.

The users interacted with the system, it would think, crunch and grind, and then spit out the results on the screen. Dumb Terminals would just scroll, but Smart Terminals had addressable cursors, function keys, some even had crude form languages. You could download a simple form spec showing static text along with fields that can be entered by the user. The computer sends the codes, the terminal paints the form, the user interacts, locally, with the form, then ships back the whole kit with a single SEND key. This differs from how many folks today are used to interacting with terminals, especially if they're only used to something like vi on the Linux command line. Send a key and the computer responds directly.

Now, this was all late 80's and early 90's. During that time, the world shifted a bit with the introduction of the PC on a wide spread basis. Now folks were using PCs for personal work, and maybe running "Green Screen" apps through a terminal emulator.

It didn't take long for people to want to leverage those PCs for shared work, and with the aid of networking and database servers, the era of Client Server Applications was born. Large applications, locally installed and running on individual computers while the central computer was delegated to serving up data from its database. Visual Basic, Power Builder, and a plethora of other Client Server "4GLs" took the market by storm, and every back office application coder was pointing and clicking their way to GUI glory.

Of course C/S programming is still with us today, "fat apps" we call them now. The 4GLs have kind of lost their luster. Make no mistake, there are a zillion lines of VB and Java being written today for back office applications, but the more specialized tools are no longer as popular as they once were. The general purpose tools seem to be doing the job nicely.

However, the Buzz of application development hasn't been with the Fat C/S app. Fat apps have all sorts of deployment, compatibility, resource, and portability issues. Having to roll an update of a C/S application out to 1000 users is a headache for everyone involved.

No, today it's the web. Everything on the web. Universal access, pretty GUIs, fast deployments, centralized control. We all know how the web works, right?

Sure. The computer sends the codes, the browser paints the form, the user interacts, locally, with the form, then ships back the whole kit with a single SUBMIT key.

Where have we heard that before? Central computer, bunch of "smart clients". Of course we use TCP/IP today instead of RS-232, and the clients are much more interesting. The browser is vastly more capable and offers higher bandwidth interfaces than a Green Screen ever could. But, the principle is pretty much the same. Everything on the central computer, even if the "central computer" is now several computer and a bunch of networking gear.

If you've been paying attention, you may have noticed over the past couple of years the the browser is getting much, much smarter.

It's not that this is new, it's been happening for quite some time. More and more Web Applications are pushing more and more logic down in to the browser. AJAX is the buzzword, and GMail is the poster child.

The JavaScript engines within the browsers, along with other resources, are going to change how you as an application developer are going to develop your back office applications. Not today, not this second, but it's always good to look ahead and down the road.

First, we have good old Microsoft. Who'd have thought the thing that threatens Microsoft the most comes from with Microsoft Labs itself. Yes, this is all their fault.

Microsoft, in their brilliance and genius gave the world the ubiquitous "XHR", the XmlHttpRequest. XmlHttpRequest is the little nugget of logic that enables browsers to easily talk back to the server through some mechanism other than the user click a link or a submit button. There were other "hacks" that offered similar abilities, but they were, well, hacks. Ungainly and difficult to use. But XmlHttpRequest, that's easy.

From the introduction of XHR, we get the rise of the AJAX libraries. Things like prototype, and jQuery. JavaScript libraries that, mostly, give ready access to the HTML DOM within the browser, but also provide ready access to servers via XHR. There's a lot you can do with JavaScript and doing DOM tricks, and pretty animations, and what not. But it gets a lot more interesting when you can talk to a server as a dynamic data source. So, while DOM wrangling was fun and flashy for menus and what not, XHR is why more folks are interested in it today than before.

These first generation JS libraries provide a foundation that folks can start to build upon. They basically open up the primitives and elements used to construct browser pages. And once you open that up, folks, being programmers, are going to want to make that easier to use.

Enter the next generation. JavaScript Component Libraries. Dojo, YUI, ExtJS. From primitive DOM and node shuffling to high level widgets. Widgets that start to work across the different browsers (cross browser compatibility always bringing tears of joy to any coder who has had to deal with it...well, tears at least).

With a Widget Library, you start getting in to the world that Windows and Mac OS coders started with. You end up with a clean slate of a page, an event model of some kind to handle the keyboard and mouse, and high level widgets that you can place at will upon that clean slate, to do a lot of the heavy lifting. On top of that, you have simple network connectivity.

This is where we were as an industry in the late 80's and early 90's. This kind of technology was becoming cheap, available, and commonplace.

And what did we get when this happened in the 90s? The 4GLs. VB, Power Builder, etc. Higher level language systems that made combining widgets and data together easier for more mundane users.

That's the 3rd generation for the browser. That's where we are today. On the one hand, you have the server centric component frameworks, like JSF, .NET, Wicket, etc. Really, these aren't yet quite as rich as what the modern browser can provide. They have "AJAX Components", but in truth the developers are still coding up forms with dynamic bits that use JavaScript rather than JavaScript Applications that run essentially purely in the browser.

There's GWT, a clever system that lets you write your client code in Java and download it in to a browser to run, after the toolkit compiles the Java in to JavaScript. Here, you can create "fat" JavaScript applications.

But, also, we have the recent announcements of Apples work with SproutCore, as well as the 280 North folks with their "Objective-J". These are not simply widget frameworks. They're entire programming systems where the premise is that the browser is the runtime environment, while the server is simply there for data services. Classic Client/Server computing ca 1992.

Of course today, the server protocol is different. Back then we were shoving SQL up and getting data back. Today, nobody in their right mind is pushing back SQL (well, somebody is, but there's that "right mind" qualifier). Rather they're talking some kind of higher level web service API (REST, SOAP, POX, JSON, not really important what). Today we have App servers that are more powerful, complicated, and robust than DBMSs and Stored Procedures.

Take note of the work of Apple and Mozilla with their efforts to speed up and improve JavaScript. Because like it or not, JavaScript is becoming the lingua franca of the modern "Fat App". The language is getting better, meeting the desires of the dynamic language wonks, as well as the "programming in the large" folks with better modularization, giving us more flexibility and expressibility. JavaScript is also getting faster, and the modern browsers are gearing up to be able to handle downloading 1MB of JS source code, compile it, and execute it efficiently, and for a long period of time (which means fast code, good garbage collectors, etc.).

You'll note that this work isn't in the area of Flash or Silverlight. Programming Flash or Silverlight is no different than programming Java Applets. You create some compiled application that's downloaded to an installed runtime on the clients computer. By promoting JavaScript, and the HTML DOM, even though more effort is being made to hide that from the day to day coder, Apple and Mozilla are promoting more open standards. IE, FireFox, Opera, Safari, four different JS and HTML "runtimes", not to mention the bazillion phones and other implementations.

Of course, once you start doing all of your client logic in JavaScript, it won't take long for folks to want to do the same thing on the server side. Why learn two languages when one is powerful and expressive enough for most everything you would want to do?

With the rise of client side JavaScript, I think we'll see a rise of server side JavaScript as well. It will be slower. Hard to fight the tide of PHP and Java, but, at least in Java's case, JavaScript runs fine on Java. So, it's not difficult start using JS for server side logic. Heck, the old Netscape web server offered server side JS 10 years ago, I don't know if the Sun Web server maintains it any more or not (SWS is the NS heir). Running JS via CGI with Spidermonkey is trivial right now as well, but I doubt you'll find many $5/month hosts with a handy Spidermonkey install.

So, no, not quite prime time yet..but soon. Soon.

Of course, maybe not. Perhaps will end up being delegated as a runtime language for the GWTs and Objective-J's of the world.

The biggest impact will be to the web designers. Static web pages won't be going away any time soon, but more and more web designers are going to have to become programmers. They won't like it. Web Design != Programming, different mindsets, different skill sets.

Anyway, get your party hat on and get ready to welcome back the "Fat App". Oh, and perhaps polish up on your JavaScript if you think you'd like to play in this arena. Yes, you too you server side folks.

Thursday, June 5, 2008

Jave the next GAE language? That's probably half right.

Chris Herron pointed me to the article where Michael Podrazik suggests that Java will be the next language for the Google App Engine runtime.

I think he's half right.

By that I think if there's any Java in the next GAE language, it will be JavaScript.

Why is that?

It's pretty clear that Google has a lot of experience in house with JavaScript. The GWT runtime is entirely in JavaScript. They have their own XSLT processor in JavaScript (for browsers that don't support XSLT natively). Also, they have their Rhino on Rails project, which is a Ruby on Rails port to JavaScript.

Next, JavaScript fits nicely in to the existing GAE infrastructure. It can be run just like Python is now. Also, there are several OSS JavaScript interpreters available to be used, of varying quality. The new runtime for FireFox 3 based on Adobes ActionScript is one, also the recently announced SquirrelFish runtime from WebKit could be used.

The GAE API would fit well in to a JavaScript world, with less "square peg round hole" work that using Java would entail.

JavaScript, with its push to JavaScript 2.0 is rapidly growing up. It's always been an elegant language with its prototype inheritance scheme (some would argue it's a blight on the planet, but that's more a paradigm complaint I think). The 2.0 changes will make it a bit more mainstream, make it faster, and even more powerful. So JavaScript is powerful today, but getting even moreso. The tooling surrounding it is getting better as well.

Finally, there are a bazillion web developers who are becoming, whether they like it or not, conversational in JavaScript. Before there was a clean separation between the client side and server side developers. Client side did HTML and CSS, while server side did scripting and logic.

But with the modern browsers having powerful JavaScript engines, and UI demands requiring fancier client side scripting for effects etc., not to mention Ajax, the client side developer has had the world of scripting and programming logic thrust upon them.

Some take to it well and become adept at leveraging JavaScript and its powers. Others simply cut and paste their way to success using the multitude of examples on the web. Either way, whether novice or expert, the client side developer is learning the fundamentals of programming and the nuances of the runtime through JavaScript.

If the client side developer were able to leverage that JavaScript knowledge on the server side, that empowers them even more.

JavaScript has had a mixed history on the server side. Netscapes server has supported server side JavaScript since forever, but obviously when someone thinks about the server, JavaScript is far from their mind. It has almost no mindshare.

Yet, we have, for example, the Phobos project which is a JavaScript back end, as well as the previously mentioned Rhino on Rails internal Google project. These are recent, though, without a lot of public history.

Now, to be fair, these are both Java systems operating as the host for a JavaScript based system. But there's no reason they have to be Java. The major browsers certainly don't use a Java runtime for their JavaScript systems, they use a C/C++ implementation.

With a C/C++ implementation, Google could readily launch a JavaScript runtime for their GAE that would fit quite well with their current infrastructure. Also, since there's very little momentum on the JavaScript server side, there's no real competition. No "why can't it operate like Project X". This gives Google even more freedom in terms of shaping the runtime the way they think it should be done.

So, I think that if there is any Java in GAE future, the near term will be in name only, with JavaScript.

You heard it here first.

Monday, April 14, 2008

EC2 Persistent Storage -- It's a crutch.

You must understand, I live locked in a small box, and I'm aloud only this keyboard, a 9" B&W monitor, and a mailbox that only a single person knows about. Thankfully, that person is Chris Herron, and he keeps sending me intersting things.

Specifically, a recent blog post about Amazons new Persistent Storage service for EC2.

What it does is make high bandwidth, high granularity, permanent storage available to EC2 nodes.

One of the characteristics of EC2 is that your instance lives on a normal, everyday Intel machine with CPU, memory, and hard drive. (Actually this is most likely a VM running someplace, not a physical machine instance, but you never know.) But the model of the service is that while all of those capabilities are available to you, and the hard drive is indeed simply a hard drive, the machine that all is contained in can up and vanish at any time.

One minute everything is running along hunky dory, and the next your machine is dead.

Now most folks who do most simple things, rarely lose an entire machine. They might lose access to the machine, like the network going down. They might lose power to the machine. Or, an actual component on the machine may up and fail. But in all of these cases, for most scenarios, when the problem is resolved, the machine comes back to life effectively in the state it was in before the event. A long reboot perhaps. Barring loss of the actual hard drive, these failures, while clearly affecting the availability of the machine (can't use it when you're having these problems), don't really affect the integrity of the maching. A hard drive failure is typically the worst simple failure a machine can have, and simply adding a duplicate drive and mirroring it (which every modern OS can do today) can help protect from that.

The difference with EC2 is that should anything occur to your "machine" in EC2, you lose the hard drive. "Anything" means literally that, anything. And there are a lot more "anythings" within EC2 than a classic hosting environment. Specifically since they don't promise your machine will be up for any particular length of time, you can pretty much be assured that it won't be. And truth is, it doesn't matter what that length of time is, whether it's one day, one week, one month or one year. Whenever the machine goes down, you effectively "lose all your work".

Anyone who has worked in a word processor for several hours only to have the machine restart on you can share in the joy of what it's like to lose "unsaved work". And that is what any data written to the hard drive of an EC2 instance is -- unsaved work. Work that has the lifespan of the machine. Please raise your and if you'd like a years worth of sales, order history, and customer posted reviews to vanish in a heart beat. Anyone? No, of course not.

Amazons original solution was S3, their Simple Storage Service. This is a very coarse service, basically working at the level of not just a single file, but even to the point that you can only replace the entire file rather than update a section of it. You only have simple, streaming read and write functons.

Next, came SimpleDB, which Amazon offers as the next level of granularity. This allows small collections of attributes to be accessed individually. You can query, add, and delete the collections. Much better than S3, but it has it's own issues. Specifically it's "eventual consistency" model. I bet most folks don't enjoy this characteristic of SimpleDB.

The new Peristent Storage service is what everyone has been looking for. Now they can go back to their old model of how computer systems are supposed to work, and they can host a RDBMS just like before. There was nothing stopping folks from running an RDBMS on any of the EC2 instances before, save that nagging "unsaved work" detail. Lose the instance, lose the RDBMS.

I can see why folks have been clamoring for this service, but frankly, I see it as a step backward from the basic tenet that Amazon allows folks to readily build scalable applications.

As noted in earlier, today the most common bottleneck in most scalable systems IS the RDBMS. RDBMS do not easily scale like most other parts of an application. It's the reality of the distributed application problem. And Amazons approach to addressing the problem with SimpleDB is I think admirable.

It's not the solution people want, however. They WANT a scalable RDBMS, and SimpleDB simply is not that beast. But scalable RDBMS's are very, very difficult. All of theses kinds of systems have shortcomings that folks need to work around, and an RDBMS is no different. Amazingly, the shortcomings of distributed RDBMS are much like what SimpleDB offers in terms of "eventual consistency", but the RDBMS will struggle to hide that synchronizing process, like Google's Datastore does.

In the end, SimpleDB is designed the way it is, and is NOT an RDBMS, for a reason, and that's in order to remain performant, and scalable while working on a massive parallel infrastructure. I am fully confident that you can not "Slashdot" SimpleDB. This is going to be one difficult beast to take down. However, the price of that resiliency is it's simplicity and it's "eventual consistency" model.

There's a purity of thought when your tools limit you. One of the greatest strengths, and greatest curses in the Java world are the apparently unlimited number of web frameworks and such available to developers. As a developer when you run in to some shortcoming or issue with one framework, it's easy and tempting, especially early on, to let the eye wander and try some other framework to find that magic bullet that will solve your problem and work the way you work.

But it can also be very distracting. If you're not careful, you find you spend all your time evaluating frameworks rather than actually getting Real Work done.

Now, if you were, say, an ASP.NET programmer, you wouldn't be wandering the streets looking for another solution. You'd simply make ASP.NET work. For various reasons, there are NOT a lot of web frameworks in common use on the .NET world. There, ASP.NET is the hammer of choice so as a developer you use it to solve your problems.

Similarly, if SimpleDB were your only persistence choice, with all of its issues, then you as a developer would figure out clever ways to overcome its limitations and develop your applications, simply because you really had no alternative.

With the new attached Persistent Storage, folks will take a new look at SimpleDB. Some will still embrace it. Some will realize that, truly, it is the only data solution they should be using if they want a scalable solution. But others will go "you know, maybe I don't need that much scalability". SimpleDB is such a pain to work with compared to the warm, snuggly comfort of familiarity that an RDBMS is, folks will up and abandon SimpleDB.

With that assertion, they'll be back to running their standard applicaton designs for their limited domains. To be fair, the big benefit of the new storage is that generic, everyday applications can be ported to the EC2 infrastructure with mostly no changes.

The downside is that these applications won't scale, as they won't embrace the architecture elements that Amazon offers to enable their applications to scale.

I think the RDBMS option that folks are going to be giddy about is a false hope.

As I mentioned before, RDBMS's are typically, and most easily, scaled in a vertical fashion. In most any data center, the DB machine is the largest, and most powerful. If there's a machine in the datacenter with multiple, redundant networks, power supplies, hard drives, cpus, etc. If there is any single machine that's designed to survive a common component failure, and continue running, it's the DB machine.

For example, ten web machines talking to a single DB machine. Any one of those web machines can fail and the system "works". Kill the DB machine, and the rest become room heaters.

The crux here, tho, is all of those machines in the EC2 cloud are basically like those web machines. Cheap, plentiful, and unreliable. The beauty of having easy access to lots of machines is the ability to lose them and still run. However, if you plan on running a DB on one of these, just be aware that it's just as fragile and unreliable as the rest. This machine WON'T have all the redundant features and such. It can come and go as quickly as the rest.

But, being THE DB machine, means that's not what you want. Lose a web server, eh, big deal. Lose the DB, BIG DEAL, the site is down and the phones are ringing.

Also, you're limited to the vertical scaling ability of their instances. Amazon offers larger nodes for hosting, more cpus, etc. You can vertically scale to a point with Amazon. But once you hit that boundary, you're stuck.

In Amazon land, if you want a long term, reliable, and performant granular data storage, then you should be looking at SimpleDB, not the Persistent Storage. I think that if you want to build a scalable system on Amazon, you should work around the SimpleDB offering. The best use for the Persistent Storage and an RDBMS would be for things like a decision support database, something that is more powerful when it works with off the shelf tools (like ODBC) and that doesn't have the scaling needs of the normal application.

So, don't let the Peristent Storage offering fool you and distract you from the the real problems if you want a scalable system.

Friday, April 11, 2008

GAE - Java?

At TheServerSide, there is an article about "Java is losing the battle for the modern web." It's chock full of bits and particles, at least indirectly, about issues with hosting web applications.

That's what GAE does. It hosts web applications. Currently, it hosts Python applications, but it can most likely host most anything, once they get a solid API for talking to the Datastore written for the environment.

Well, I should qualify that.

It can host most anything that works well with the idiom of "start process, handled request, stop process". As I mentioned before, GAE is running the CGI model of web application. This is counter to how Java applications run, however.

Java promotes executing virtual, deployable modules within a long running server process, typically a Java Servlet container. Most containers have mechanisms to support the loading, unloading, and reloading of these deployable modules. You can readily support several different web applications within a single Java Servlet container. As a platform, it's actually quite nice.

Java itself also promotes this kind of system. Java relies on long running processes to improve performance. Specifically, Java relies on "Just In Time" (JIT) compilers to translate JVM bytecode in to native code. The magic of the JIT compiler is that it can observe the behavior of some Java code, and dynamically compile only the parts the JIT compiler feels are worthy of being compiled.

For example, say you have a class that has some non trivial initialization code that runs when an instance is constructed, but it also has some compute intensive methods. If you only create one instance of that class, and execute the compute intensive methods, the JIT will convert those methods in to native code, but will most likely not convert the logic in the constructor. It only runs once and isn't worth the expense of converting in order to make it run faster.

So, Java is a kind of system that gets faster the longer it's run. As the JIT observes things, and gets some idle time, it will over time, and incrementally make the system faster via its compiler. Over time, more and more of the system is converted in to native code. And, it so happens, very good native code.

The issue, however, is that in the case of something like GAE, a system such as Java is almost at complete odds with the environment that GAE promotes. GAE wants short processes, and lots of them, rather than large, single, long running processes.

So is Java completely out of the picture for something like GAE?

Actually, no I don't think so.

Just because Java LIKES long running processes, doesn't mean it's unusable without them. For example, the bulk of the Java community uses the Ant tool to build their projects. That's a perfect example of a commonly used, short term process in Java. Even javac is a Java program.

Java is perfectly usable in a short term scenario. Java could readily be used for CGI programs. What CAN'T (at least now) be used for CGI programs are typical Java web applications. They're just not written for the CGI environment. They rely on long running processes: in memory sessions, cached configuration information, Singleton variables. You most certainly wouldn't do something silly like launch Tomcat to process a single server request. That's just plain insane.

As a rule, Java tends to be "more expensive" to start than something like Perl or PHP. The primary reason is that most of Java is written in Java. Specifically the class library. So, in order to Do Anything, you need to load the JVM, and then start loading classes. Java loads over 280 classes just to print "Hello World" (mind these 280 classes come for only 3 files). All of that loading has some measure of overhead. I well imagine that the code path between process start, and "Hello World" is longer in Java than in, say, Perl. That code path is startup time.

Of course, in modern web applications, startup time is almost irrelevant. Why? Because almost everyone embeds the scripting language in to the web server. That's what mod_perl and mod_php do. They make the actual language interpreter and runtime 1st class citizens of the web server process. This is in distinction to starting a new process, loading the interpreter, loading and executing your code, and then quitting. Apache will pay the interpreter startup cost just once, when Apache starts. There may be some connection oriented initialization when Apache forks a new process to handle a connection, but those connection processes are long lived as well.

So, it turns out, when you're running your language interpreter within the web server, startup time is pretty much factored out of the equation. Unless it's unrealistically long, startup time is a non-issue with embedded runtimes.

Which brings the question "Where is mod_java?" Why not embed Java? And that's a good question. I know it's been discussed in the past, but I don't know if there's reasonable implementation of embedding the JVM within an Apache process.

What does mod_java need to do? The best case scenario would be for an embedded JVM to start up with the host Apache process. The JVM would then load in client classes on request, execute them, and return the result. The last thing the JVM would do is toss aside all of the code it just loaded. It would do that via a special ClassLoader responsible for loading everything outside of the core JVM configuration. This helps the JVM stay reasonably tidy from run to run.

The cool thing about this, is that the code that this JVM runs could readily use the Servlet API as its interface. The Servlet API has the concept of initializing Servlets, executing requests, and unloading Servlets. It also has the concept of persistent sessions that last from request to request. Obviously, most containers are long running, so those lifecycle methods are rarely invoked. Also, most folks consider sessions to be "in memory". Applications would need to adapt their behavior to assume that these lifecycle methods are called all the time, and that your sessions are going to be written to and read from persistent storage every single request. So, you'd want web applications that have fast servlet initialization times and that store little session data.

But those applications can still live under the purview of the standard Java Servlet API.

That means that you could have mod_java, and CGI style web apps, with JSPs and everything.

Most of the standard web frameworks would be out the window, most have long startup and configuration times. But if the idiom becomes popular, no doubt some CGI friendly frameworks would pop up, changing the dynamic of that one time configuration, perhaps being more lazy loading about it.

But would this kind of system perform?

Sure it would. In fact, it would likely perform better (in terms of pure application or script execution) than things like Perl or PHP. Why? Because Perl and PHP have to load and parse the script text every single request. Java just has to load bytecodes. Python has a similar pre-parsed form that can speed loading as well.

In this way, it turns out you can run Java safely, within a process model, you can run it quickly (though most likely not as quick as a long running process), you get to use all of the bazillion lines of off the shelf library code, and even still use the fundamental Servlet API as well.

Things like Hibernate, EJB, and most of the web frameworks would not apply however, so it will be a different model of Java development. But it IS Java, and all of the advantages therein.

And if you want to instead run JRuby, Jython, Javascript, Groovy, or any other Java based scripting langauge, knock yourself out. In that case, it would be best to have the mod_java perform some preload of those systems when Apache starts up, so they can be better ready to support scripting requests at request time with little spin up.

You would also have to limit the CGI Java processes from running things like threads, I would think. The goal is for the JVM to remain pristine after each request.

Google could readily incorporate such a "mod_java" into the GAE and make Java available to users. They can do this without having to reengineer the JVM.

There is one JVM change that would make this mod_java that much better, and that's the capability for the JVM to both grow dynamically in memory, and also free memory up back to the OS. I know JRockit can dynamically grow, I do not know if it can dynamically shrink.

If the JVM could do that, then there's no reason for the "cheap hosts" to not provide this style of Java capability on their servers, as hosting Java becomes little different than hosting PHP.

And wouldn't that be exciting?

Thursday, April 10, 2008

Contrasting SimpleDB and GAE Datastore

Part and parcel to the infrastructures that Amazon and Google are promoting are their internal persistence systems.

Let's talk scaling for just a sec here. There are two basic ways that applications can be scaled. Horizontal scaling, and Vertical scaling.

Horizontal scaling is spreading the application across several machines and using various load balancing techniques to spread application traffic across the different machines. If horizontal scaling is appropriate for your application, then if you want to support twice as much load, you can add twice as many machines.

Vertical scaling is using a bigger box to process the load. Here, you have only one instance of the application running, but it's running on a box with more CPUs, more memory, more whatever was limiting you before. Today, a simple example would be moving from a single CPU machine to a dual CPU machine. Ideally the dual CPU machine will double your performance. (It won't for a lot reasons, but it can be close.)

Websites, especially static websites, are particularly well suited to horizontal deployments. If you've ever downloaded anything off the web where they either asked you to select a "mirror", or even automatically select one for you, you can see this process in action. You don't care WHICH machine you hit as long as it has what you're looking for. Mirroring of information isn't "transparent" to the user, but it's still a useful technique. There are other techniques that can make such mirroring or load balancing transparent to the user (for example, we all know that there is not a single machine servicing "www.google.com", but it all looks the same to us as consumers).

Vertical scaling tends to work well with conventional databases. In fact, vertical scaling works well for any application that relies upon locally stored information. Of course, in essence, that's all that a database is. But databases offer a capability that most applications rely upon, and that's a consistency of the view of data that the database contains. Most applications enjoy the fact that if you change the value of a piece of data, when you read that data back it will have the changed value. And, as important, other applications that view that data will see the changed data as well. It's a handy feature to have. And with a single machine hosting the database, it's easy to acheive. But that consistency can really hamper scaling of the database, as they're limited by machine size.

Lets look at a contrived example. Say you have a single database instance, and two applications talking to it. It seems pretty straightforward that when App A makes a change in to the DB, App B would see it as well. Same data, same machine, etc. Simple. But you can also imagine that as you start adding more and more applications talking to that database instance, that eventually it's simply going to run out of capacity to service them all. There will simply not be enough CPU cycles to meet the request.

You can see that if the applications are web applications, as you horizontally scale the web instance, you add pressure to your database instance.

That's not really a bad thing, there are a lot of large machines that run large databases. But those large machines are expensive. You can buy a 1U machine with a CPU in it for less than a $1000. You can buy 25 such machines for less that $25000. But you can't buy a single machine with 25 CPUs for $25000. They're a lot more. If you want to run on cheap hardware, then you need go horizontal.

So, why not add more database instances?

Aye, there's the rub. Lets add another database instance. App A talks to DB A, and App B talks to DB B. A user hits App A, and changes their name, and App A sends the update to DB A. But, now that users data doesn't match on DB B, it has the old data (stale data as it were). How does DB B get synchronized with DB A? And, as important, WHEN does it get synchronized? And what if you have instead of just two instances, you have 25 instances?

THAT is the $64 question. It turns out it's a Hard Problem. Big brainy types have been noodling this problem for a long time.

So, for many applications, the database tends to be the focal point of the scalability problem. Designers and engineers have worked out all sorts of mechanisms to get around the problem of keeping disparate sets of information synchronized.

Now, Amazon and Google are renting out their infrastructure with the goal of providing "instant" scalability. They've solved the horizontal scaling problem, they have a bazillion machines, Amazon will let you deploy to as many as you want, while Google hides that problem from you completely.

But how do they handle the data problem? How do they "fix" that bottleneck? Just because someone can quickly give you a hundred machines doesn't necessarily make solving the scalability issue easier. There's a bunch of hosts out there that will deploy a hundred servers for you.

Google and Amazon, however, offer their own data services to help take on this problem, and they're both unconventional for those who have been working with the ubiquitous Relational Database Systems of the past 30 years.

Both are similar in that they're flexible in their structure, and have custom query languages (i.e. not SQL).

Googles datastore is exposed to the Python programmer by tightly integrating the persistence layer with the Python object model. It's also feature rich in terms of offering different data types, allowing rows to have relationships to each other, etc. Google limits how you can query the data with predefined indexes. You as the developer can define your indexes however your want, but you will be limited to query your data via those indexes. There's no real "ad hoc" query capability supported by the datastore. Also, the Google datastore in transactional in that you can send several changes to the datastore at once, and they wil either all occur "at once", or none of them will occur.

Amazon's SimpleDB is more crude. Each database entry is a bag of multivalued attributes, all of which need to be string data. You as a developer are burdened with converting, say, numbers from string data in to internal forms for processing, then converting them back in to string values for storing. Also, Amazon doesn't allow any relationships among its data. Any relationships you want to make must be done in the application. Finally, SimpleDB is not a transactional system. There seems to be a promise that once the system accepts your change, it will commit the change, but you can't make several changes over time and consider them as a whole.

Finally. there's one other crucial advertised difference between Amazon's and Google's systems. SimpleDB is designed to scale, and exposes that design to the developer. Google's is also, but it offers a different promise to the user.

See, Google appears to be promising consistency across the database. That's all well and good, but as you load down the database, that maintenance has costs. SimpleDB, on the other hand, and interestingly enough, does NOT guarantee consistency. Well, at least not immediately.

For example, read data from the database, say that user record with the user name in it. You can update the data with the new name, and write it back to the database. If you then immediately read it back, you may well get the OLD record with the OLD name. In the example above, you just updated DB A, and read back the data from DB B.

Amazon guarantees that "eventually", your data will be consistent. Most likely in a few seconds.

Now, Google doesn't stipulate that limitation. The API says "update your data and the transaction commits or it doesn't". That implies when you write the data, it's going to be there when you read it back, that your new data will immediately be available.

Now, Amazon, by punting on the integrity and consistency guarantee, they are pushing some of the complexity of managing a distributed application back on to the developer.

In truth this is not such a bad thing. By exposing this capabaility, this limitation, you are forced as a developer to understand the ramifications of having "data in flight" so to speak, knowing that when you look at the datastore, it may be just a wee bit out of date. This capability will definately turn your application design sideways.

In return though, you will have a scalable system, and know how it scales. Building applications around unreliable data on unreliable machines is what distributed computing is all about. That's why it's SO HARD. Two of the great fallacies of network computing is that the network is cheap and reliable, when in fact it's neither. Yet many application developers consider the network as safe, because many idioms make the pain of the network transparent to them, giving the illusion of safety.

Amazons SimpleDB doesn't. They basically guarantee "If you give us some data, we'll keep it safe and eventually you can get it back". That's it. If that "eventually" number is lower than the times between queries, then all looks good. But being aware that there IS a window of potential inconsistency is a key factor in application design.

Now, Google hides this side affect of the database implementation from you. But it does impose another limitation which is basically that your transaction must take less than 5 seconds or it will be rolled back. To be fair, both systems have time limits on database actions, but what is key to the Google promise is that they can use that time window in order to synchronize the distributed data store. The dark side of the 5 second guarantee is not that your request will fail after 5 seconds, but EVERY request can take UP TO 5 seconds to complete.

SimpleDB could have made a similar promise, each commit must take 5 seconds, and use that 5 second window to synchronize your data store, but at the price of an expensive DB request. Instead, the return "immediately", with assurance that at some point, the data will be consistent, meanwhile you can mosey on to other things. What's nice about this is that if it takes more than 5 seconds for the data to become consistent, you as a developer are not punished for it. With Google, your request is rejected if it takes too long. With Amazon, it, well, just takes too long. Whether it take .1 seconds to get consistent or 1 minute, you as a developer have to deal with the potential discrepancy during application design.

Have you ever posted to Slashdot? At the end, after your post, it says "your post will not appear immediately". That's effectively the same promise that Amazon is making.

What it all boils down to is that the SimpleDB request is an asynchronous request (fire and forget, like sending an email), while the Google request is synchronous (click and wait, like loading a web page). They both do the same thing, but by exposing the call details, Amazon gives the developer a bit more flexibility along with more responsability.

But here's the golden egg thats under this goose. Both solutions give you better options as a developer for persisting data than an off the shelf Relational Database, at least in terms of getting the application to scale. Recall that the database tends to be the bottleneck, that lone teller in the crowded bank who everyone in line is cursing.

For large, scaled systems, both of these systems handle a very hard problem and wrap it up in a nice, teeny API that will fit on an index card, and they give it to you on the cheap.

Wednesday, April 9, 2008

Application Deployment -- the new Models

As many have heard, Google has announced a new hosting service. The call it the Google App Engine.

Over on the TheServerSide, they're comparing GAE and Amazons EC2.

Meanwhile, recently on JavaLo^H^H^H^H^H^HDZone, there was lamenting about cheap Java hosting.

EC2 is very interesting. They managed to effectively out grid Suns Grid, at least for most people. EC2 makes commodity computing even more commodity. It rapidly eliminates more of the scaling barriers folks have. Suns system is more geared to higher performance, short term, High CPU jobs, whereas EC2 is being used for actual hosting.

For $72/month, you end up with a pretty reliable, persistent system with EC2. You need to structure your application to leverage the Amazon infrastructure, but the pay off is high.

Beyond the infrastructure details, Amazon offers a pretty standard development model. The infrastructure does force some design constraints (specifically the machine you're running on can be yanked out from underneath you at a moments notice, so don't store anything Important on it), but once you tweak your dynamic persistence model, EC2 offers up a straight forward deployment environment.

But GAE is different. GAE has turned the model upside down and has, in fact, rewound the clock back to web development circa 1996.

Essentially, GAE promotes the Old School CGI model for web development. Specifically, it's embracing the classic old Unix Process model for web development. This is contrary to much of the work today on threaded server architectures, notably all of the work done in Apache 2, as well as the stock Java model of application development and deployment as well as the new "Comet" application architecture.

See, threads live in a shared environment and are lighter weight in terms of CPU processing for switching between tasks. Simply put, a server can support more independent running threads than it can support independent running processes, and the time to switch between them is less. That means a threaded system can support more processes, and will have a faster response time.

But, the cost of threads is you lose some safety. The threads all share a parent process. If one thread manages to somehow corrupt the parent process, then not only is the individual thread impacted, but so are all of the shared threads.

You can see this in a Java server by having a single thread allocate too much memory, or consume too much CPU where the only resolution is to restart the Java process. If that server were supporting several hundred other connections, all of those are reset and interrupted.

With a process model, each process can be readily limited by the host operating system. They can have their resources easily curtailed (amount of memory, how much disk they use, total CPU time, etc.). When a process violates it's contract with the OS, the OS kills it without a second thought. The benefit there is that your overall server is more stable, since the death of a process rarely affects other running processes.

But if threading is more efficient, why would Google go back to the process model? Can't threads be made safe?

Sure, threads can be made safe. If processes can be made safe, threads can be made safe. It's just a lot more work.

However, here is the crux of the matter. Threads make more efficient use of the host CPU. That's a given. But what if CPU scarcity is not a problem? What if conserving CPU resources is no longer a driver? What if overall response time, system stability and scalability are more important than CPU efficiency?

For Google, CPUs are "free". Granted, I imagine that Google spends more on powering and hosting CPUs than it does on payroll (I don't know, that's just a guess), but wrangling thousands of individual CPUs is a routine task at Google, and they have power to spare.

Using that model, here's how the GAE system turns the application hosting space on its head.

First, unlike Amazon, they're offering you an environment with a bit of disk space and some bandwidth. Your application can't serve up anything other than its own files, or anything else available over HTTP. Your application can not change it's space that it's deployed in (they have the Datastore for that). Your application has hard run time parameters place upon it.

Also, and most importantly, your application has no promise as to WHERE it is being run. You have NO IDEA what machine any individual request will be executed upon. Every other hosting model out there is selling A Machine. GAE is selling you some CPU, somewhere, anywhere, and some bandwidth.

All of your actual applications are running within hard processes on some server somewhere, yet all of your data is served up from dedicated, and different, static data servers. This lets Google leverage the threading model for things like static resources (where it's very good), but use the process model for your applications (which is very safe).

What can Google do with this infrastructure? Simply put, it can scale indefinitely. Imagine a huge array of CPUs, all sharing a SAN hosting your application. Any of those CPUs can run your application. If you get one web hit, no problem. If you get Slashdotted, again, no problem. Since you're not on a single machine, your application WILL scale with demand. One machine, or a hundred machines, makes no difference.

As always, the hardest hit part will be the data store. But I have to imagine that Google has "solved" this problem as well, to a point. Their Datastore has to be as distributed as it can be. We'll have to wait and see more about how this works out.

Where does Java fit in this? Well, it doesn't. Unless you want to run CGI apps. Java web apps are basically long running processes, and those simply don't exist on the Google infrastructure.

Exciting times. I fully expect to see startups, OSS Frameworks, and other papers written up on this infrastructure so it can be duplicated. There's no reason folks can not clone the GAE (or even Amazon) network apis and host this kind of infrastructure in house. Long term, there will be no "vendor lock in" I don't think.

Welcome to my Blog

Some folks have said "Hey Will, you should have a blog." For some reason, I seem to have listened to them and this is the result.

No presumptions, no promises, we'll see where this thing goes.

Follow along and join in if you're so inclined.