Friday, April 9, 2010

AppDomains and true isolation

As you may very well know, .Net Application domains (or AppDomains for short) are the "lightweight processes", according to authors of several books on .Net Framework (FW) I've read, and -- to a great many developers I've spoken to. For most part I agree with the metaphor, but there is one thing the AppDomains do not provide you with, while processes do: fault tolerance. The operating system doesn't crash if one of its processes crashes. But as of .Net FW 2.0 an unhandled exception in any AppDomain in a process brings that process down. I tend to think in terms of corner cases, worst case scenarios (my analyst complains about that a lot), so for me this basically means that Appdomains hardly provide any isolation at all.

Let's look at the ways of fixing that. One way to go would be to create a custom CLR host which would override the standard policy for unhandled exceptions. I will not cover this option as I'm not an expert at customizing CLR hosts. Another way is to revert to .Net FW 1.0/1.1 policy for unhandled exceptions, AND (this is very important) to emulate .Net FW 2.0 policy for the main Appdomain of the application, and to unload other domains, should they encounter an unhandled exception. This is the way I'm going to show and explain.

First, let's prove that the problem exists. The following code crashes the process for good (see for yourself by compiling it in a console application). Run it without the debugger attached -- to not be stuck on the line where the exception is being thrown:
class Program
{
    private static AppDomain _subDomain;

    static void Main(string[] args)
    {
        _subDomain = AppDomain.CreateDomain("CodeRunningDomain");

        try
        {
            _subDomain.DoCallBack(delegate()
            {
                new Thread(delegate()
                {
                    throw new InvalidOperationException();
                }).Start();
            });
        }
        catch
        {
            // this block is for demonstration purposes only,
            // generally it's not a good idea to catch System.Exception.
            // this block is never hit, of course
        }

        Thread.Sleep(TimeSpan.FromDays(2)); // the process is torn down faster than in 2 days, you will see.
    }
}

Not much to explain here -- an unhandled exception by now-standard .Net policy tears down the process. I agree that this is a great default policy due to possible state corruptions related to an error that went unhandled. But in case you're building a host, say for running plugins in their own AppDomains, or trying to execute a piece of code you don't trust in its own AppDomain, you don't want your host to die because a plugin has errored out, or because the code you did not trust, proved right your assumptions about it.

First step to work around the issue is to put the crappy .Net FW 1.0/1.1 unhandled exception policy back to business. According to this policy, the process is not torn down, when an exception goes unhandled, as you may know. To achieve this, paste the following to a newly created application configuration file for the Console Application you've probably created to run the code above:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <legacyUnhandledExceptionPolicy enabled="true"/>
  </runtime>
</configuration>

Now everything runs without crashing and the program will block for years, if you let it. But that's not what we want: .Net FW 2 policy for exceptions happens to be cool, we want it back, but only for the main AppDomain. Here is what we do: we hook up the handlers for every unhandled exception in every AppDomain. When an exception propagates up the main AppDomain (it does propagate), we look which domain has it originated from, if it's not from the main AppDomain, we unload the offending AppDomain and carry on. Otherwise we tear the process down, as (not exactly, but close) the CLR does in these cases. Of course, we're working under the assumption that a failure in one AppDomain does not (and mustn't) affect the other domains. Keep in mind that this bold assumption is not always true. I don't want to bore you to death, here is the code. (Again: run it without the debugger attached -- to not be stuck on the line where the exception is being thrown.)

using System;
using System.IO;
using System.Threading;

namespace ProperAppDomainIsolation
{
    class Program
    {
        private static AppDomain _subDomain;

        static void Main(string[] args)
        {
            AppDomain.CurrentDomain.UnhandledException +=
              new UnhandledExceptionEventHandler(MainDomain_UnhandledException);

            _subDomain = AppDomain.CreateDomain("CodeRunningDomain");

            _subDomain.UnhandledException +=
              new UnhandledExceptionEventHandler(Subdomain_UnhandledException);

            try
            {
                _subDomain.DoCallBack(delegate()
                {
                    new Thread(delegate()
                        {
                            throw new InvalidOperationException();
                        }).Start();
                });
            }
            catch (Exception)
            {
                // log, throw, do something.
            }

            Thread.Sleep(TimeSpan.FromDays(2));
        }

        static void MainDomain_UnhandledException(object sender, UnhandledExceptionEventArgs e)
        {
   // runs in main AppDomain
            Exception exception = e.ExceptionObject as Exception;

            if (exception == null)
            {
                Environment.FailFast("Very descriptive message");
            }

            Console.WriteLine("Got an exception in main application domain.");

            try
            {
                Marker exceptionMarker = ExceptionMarker.GetExceptionMarker(exception);

                Console.WriteLine("Retrieved marker: " + exceptionMarker.Value);
                Console.WriteLine("Offending AppDomain: " + _subDomain.Id);

                if (exceptionMarker.Value.Equals(_subDomain.Id.ToString(), StringComparison.Ordinal))
                {
                    new Thread(delegate()
                        {
                            try
                            {
                                Console.WriteLine("Unloading the offending application domain...");

                                AppDomain.Unload(_subDomain);
                                _subDomain = null;

                                Console.WriteLine("Unloaded the offending application domain...");
                            }
                            catch (Exception ex)
                            {
                                Console.WriteLine(ex.ToString());
                            }
                        }).Start();
                }
            }
            catch (InvalidOperationException)
            {
                // exception is not marked - originates from the main application domain. Demonstration code - a more specific exception type is required.
                Environment.FailFast("Terminating to an unhandled exception in the main Application Domain.");
            }
        }

        static void Subdomain_UnhandledException(object sender, UnhandledExceptionEventArgs e)
        {
   // runs in subdomain
            Exception exception = e.ExceptionObject as Exception;

            if (exception == null)
            {
                Environment.FailFast("Very descriptive message");
            }

            ExceptionMarker.MarkException(exception, new Marker(AppDomain.CurrentDomain.Id.ToString()));
        }
    }
}

On my box the output of the program is as follows:

Got an exception in main application domain.
Retrieved marker: 2
Offending AppDomain: 2
Unloading the offending application domain...
Unloaded the offending application domain...

Now we've got ourselves a host! The only fun part here is in distinguishing between the exceptions that came from the main AppDomain and all other domains.

(ExceptionMarker adds a special marker to the Exception's property bag - in the subdomain, and then reads the marker in main domain; if there is no marker attached to exception, then it originates from the main AppDomain, and the process is doomed. Full source code is here.)

P.S. Originally I thought that attaching a configuration file (like the one shown above) to the domains where the code runs, would suffice. Plus I planned on taking care to unload the domains when an exception occurs inside each of them. That didn't work: legacyUnhandledExceptionPolicy seems to affect the main application's AppDomain only.