Friday, April 11, 2008

GAE - Java?

At TheServerSide, there is an article about "Java is losing the battle for the modern web." It's chock full of bits and particles, at least indirectly, about issues with hosting web applications.

That's what GAE does. It hosts web applications. Currently, it hosts Python applications, but it can most likely host most anything, once they get a solid API for talking to the Datastore written for the environment.

Well, I should qualify that.

It can host most anything that works well with the idiom of "start process, handled request, stop process". As I mentioned before, GAE is running the CGI model of web application. This is counter to how Java applications run, however.

Java promotes executing virtual, deployable modules within a long running server process, typically a Java Servlet container. Most containers have mechanisms to support the loading, unloading, and reloading of these deployable modules. You can readily support several different web applications within a single Java Servlet container. As a platform, it's actually quite nice.

Java itself also promotes this kind of system. Java relies on long running processes to improve performance. Specifically, Java relies on "Just In Time" (JIT) compilers to translate JVM bytecode in to native code. The magic of the JIT compiler is that it can observe the behavior of some Java code, and dynamically compile only the parts the JIT compiler feels are worthy of being compiled.

For example, say you have a class that has some non trivial initialization code that runs when an instance is constructed, but it also has some compute intensive methods. If you only create one instance of that class, and execute the compute intensive methods, the JIT will convert those methods in to native code, but will most likely not convert the logic in the constructor. It only runs once and isn't worth the expense of converting in order to make it run faster.

So, Java is a kind of system that gets faster the longer it's run. As the JIT observes things, and gets some idle time, it will over time, and incrementally make the system faster via its compiler. Over time, more and more of the system is converted in to native code. And, it so happens, very good native code.

The issue, however, is that in the case of something like GAE, a system such as Java is almost at complete odds with the environment that GAE promotes. GAE wants short processes, and lots of them, rather than large, single, long running processes.

So is Java completely out of the picture for something like GAE?

Actually, no I don't think so.

Just because Java LIKES long running processes, doesn't mean it's unusable without them. For example, the bulk of the Java community uses the Ant tool to build their projects. That's a perfect example of a commonly used, short term process in Java. Even javac is a Java program.

Java is perfectly usable in a short term scenario. Java could readily be used for CGI programs. What CAN'T (at least now) be used for CGI programs are typical Java web applications. They're just not written for the CGI environment. They rely on long running processes: in memory sessions, cached configuration information, Singleton variables. You most certainly wouldn't do something silly like launch Tomcat to process a single server request. That's just plain insane.

As a rule, Java tends to be "more expensive" to start than something like Perl or PHP. The primary reason is that most of Java is written in Java. Specifically the class library. So, in order to Do Anything, you need to load the JVM, and then start loading classes. Java loads over 280 classes just to print "Hello World" (mind these 280 classes come for only 3 files). All of that loading has some measure of overhead. I well imagine that the code path between process start, and "Hello World" is longer in Java than in, say, Perl. That code path is startup time.

Of course, in modern web applications, startup time is almost irrelevant. Why? Because almost everyone embeds the scripting language in to the web server. That's what mod_perl and mod_php do. They make the actual language interpreter and runtime 1st class citizens of the web server process. This is in distinction to starting a new process, loading the interpreter, loading and executing your code, and then quitting. Apache will pay the interpreter startup cost just once, when Apache starts. There may be some connection oriented initialization when Apache forks a new process to handle a connection, but those connection processes are long lived as well.

So, it turns out, when you're running your language interpreter within the web server, startup time is pretty much factored out of the equation. Unless it's unrealistically long, startup time is a non-issue with embedded runtimes.

Which brings the question "Where is mod_java?" Why not embed Java? And that's a good question. I know it's been discussed in the past, but I don't know if there's reasonable implementation of embedding the JVM within an Apache process.

What does mod_java need to do? The best case scenario would be for an embedded JVM to start up with the host Apache process. The JVM would then load in client classes on request, execute them, and return the result. The last thing the JVM would do is toss aside all of the code it just loaded. It would do that via a special ClassLoader responsible for loading everything outside of the core JVM configuration. This helps the JVM stay reasonably tidy from run to run.

The cool thing about this, is that the code that this JVM runs could readily use the Servlet API as its interface. The Servlet API has the concept of initializing Servlets, executing requests, and unloading Servlets. It also has the concept of persistent sessions that last from request to request. Obviously, most containers are long running, so those lifecycle methods are rarely invoked. Also, most folks consider sessions to be "in memory". Applications would need to adapt their behavior to assume that these lifecycle methods are called all the time, and that your sessions are going to be written to and read from persistent storage every single request. So, you'd want web applications that have fast servlet initialization times and that store little session data.

But those applications can still live under the purview of the standard Java Servlet API.

That means that you could have mod_java, and CGI style web apps, with JSPs and everything.

Most of the standard web frameworks would be out the window, most have long startup and configuration times. But if the idiom becomes popular, no doubt some CGI friendly frameworks would pop up, changing the dynamic of that one time configuration, perhaps being more lazy loading about it.

But would this kind of system perform?

Sure it would. In fact, it would likely perform better (in terms of pure application or script execution) than things like Perl or PHP. Why? Because Perl and PHP have to load and parse the script text every single request. Java just has to load bytecodes. Python has a similar pre-parsed form that can speed loading as well.

In this way, it turns out you can run Java safely, within a process model, you can run it quickly (though most likely not as quick as a long running process), you get to use all of the bazillion lines of off the shelf library code, and even still use the fundamental Servlet API as well.

Things like Hibernate, EJB, and most of the web frameworks would not apply however, so it will be a different model of Java development. But it IS Java, and all of the advantages therein.

And if you want to instead run JRuby, Jython, Javascript, Groovy, or any other Java based scripting langauge, knock yourself out. In that case, it would be best to have the mod_java perform some preload of those systems when Apache starts up, so they can be better ready to support scripting requests at request time with little spin up.

You would also have to limit the CGI Java processes from running things like threads, I would think. The goal is for the JVM to remain pristine after each request.

Google could readily incorporate such a "mod_java" into the GAE and make Java available to users. They can do this without having to reengineer the JVM.

There is one JVM change that would make this mod_java that much better, and that's the capability for the JVM to both grow dynamically in memory, and also free memory up back to the OS. I know JRockit can dynamically grow, I do not know if it can dynamically shrink.

If the JVM could do that, then there's no reason for the "cheap hosts" to not provide this style of Java capability on their servers, as hosting Java becomes little different than hosting PHP.

And wouldn't that be exciting?

No comments: