Microrebooting paper’s reasoning holes

[

CIO Today noticed microrebooting research paper.]1 While I agree that the paper is very interesting, I think there are some holes in it that were not explored (or at least not explained).

Specifically:

  1. Early in the article, memory leak and resource leaks are named as the problems solved by rebooting.

    From what I can see, microrebooting will only solve the issues where those resources/objects are held by instances that are cleared on microreboot. It will do nothing for pooled resources leaked because close/release method was not called. It will also do nothing if the memory leak is through creating too many classses at runtime (e.g. Stubs) or if resources are held in the static parts of the classes. This is because the classloader is explicitly not recreated.

  2. Recommended rebooting sequence of the application server in the paper is as follows:

    1. microreboot of the component transitive closure
    2. then kill -9 the server
    3. reboot the O/S.

    It seems that something is missing between 1) and 2). Where is attempting to shut down the server via normal shutdown command? That would allow to synchronise the buffers, finish writing out logs, etc. Going straight to kill -9 is really an emergency exit and is highly not recommended. In the same vein, I am not sure how much good rebooting the O/S is going to do for Java AppServer. Of course, the proper shutdown takes time, but is faster reboot worth getting your transactions into the doubt state due to the corrupted JTA in-flight store?

    • Threads. The paper proposes killing all threads associated with the resources. I would be very interested in how they are proposing to do that well, given that Thread.stop is deprecated beyond belief and Thread.interrupt() does nothing for method synchronization deadlocks/bottlenecks.
    • Memory recovery via microreboot. The paper suggests microrebooting the components to free some space. A more effective suggestion in my thinking would be to have hooks into the component caches and to request to drop all cached content down to the working set or to the start counts.

      So, a typical production system will start with 5 JDBC pool connections, but may go up to 100. Asking it to drop back down to 5 (or working set, whichever is larger) will free up a lot of cached result sets. Same goes for cached Entity beans, etc.

    • Finally, the main requirement for all this to work is a component storing its state in external transacted storage. Have they calculated whether the cost of shifting to external storage instead of faster in-memory/in-place is worth the time benefits of microrebooting in the long run. I am not so sure, especially with the larger HttpSession data sets I have observed in a real world.

      The exact question here is: how often does one need to microreboot instead of full reboot to save the time lost on satisfying the additional constraints? My own sort-of answer here would be that perhaps extra half-second per response is acceptible replacement, but I would like to know what the researches themselves think.

Still, all the nitpicking aside, Weblogic does provide some things already that resonate with the idea if not the suggested implementation.

  • Component redeploy will try to remove all instances (and classes) on the EAR/EJB/WAR level and reload them in. In fact I am having troubles figuring out how the micro-reboot is better than this already available functionality.
  • Node managers will monitor your system and restart the node if any of the subsystems goes into the ‘warning’ state
  • JRockit allows to monitor memory usage and fire code triggers on crossing the thresholds. You can do what you like at that point.
  • Weblogic 8.1sp3 monitors JDBC pool entries and can timeout inactive connection and consider it leaked. It also provides connection leak profiling where non-closed connection will scream when it’s finilized method is hit. It also provides resource retry for JDBC similar in spirit to what the paper talked about.
  • JMX and SNMP also allow to define all sorts of threshold with notifications triggers.
  • Thread subsystem will monitor request processing time and will log a message when a single requests takes longer then a threshold value to process. In the next version of WLS, it will also print the stack trace of the stuck thread. Notice that it will not kill the thread.

But I do welcome any research that makes support job easier. Check out the other interesting papers at the Microrecovery and Microreboot center (WayBackMachine archive)

BlogicBlogger Over and Out