EventSource in a Windows Azure WebRole

Instrumenting an ASP.NET Windows Azure WebRole using ETW (Event Tracing for Windows) with the .NET 4.5 EventSource and Windows Azure Diagnostics

Nico Vuyge

2013-02-22

Introduction

The new EventSource class is the single most important feature I've been missing in the .NET framework for over a decade. With it's release in .NET 4.5, I used it to replace the existing logging functionality in my web site with the most performant tracing technology available in Windows.

History, starting over a decade ago with the PDC 2001

I got introduced to ETW at the PDC 2001. In their session 'SVR325:Diagnosing And Debugging Live Systems', Iain Frew andIan Service presented the ETW Tracing feature introduced in Windows 2000 (unfortunately, no webcast is to be found of this session, this was from the era before the age of live conference broadcasts. I still have their powerpoint slides though.). This tracing facility was originally intended to be used to instrument device drivers, but became available to regular developers in Windows 2000. With roots in device drivers, you can imagine that this technology has very strict performance requirements. I was particulary impressed with their performance measurements. Quoting from their slides, they claimed:

"1500 machine cycles to fire and log a simple event"
"Logs 8,000 messages/second @ <2% of="" cpu="" overhead="" on="" a="" dual="" pii-300="" li="">

Even in 2001, a dual PII 300Mhz was already fairly old technology. You can imagine that the overhead on todays multigigahertz CPU's is far lower than that.

The presenters claimed '.NET Classes in progress'. .NET 1.0 was released a couple of months later, but no support for ETW event tracing was to be seen. Some support for ETW was introduced in .NET 3.5 at the end of 2007, but it's only in .NET 4.5 at the end of 2012 that we finally see support that really matches the ease of use of .NET.

Previous experience with ETW tracing

Back in 2001 I was working on a COM based C++/ATL library, intended for consumption by ASP clients (ASP classic!). As our development progressed, we felt the need for a tracing mechanism, and based on the claims above decided to implement ETW. That turned out to be a good decision as we were very happy with the performance and feature set of ETW tracing. But it was fairly complicated to implement as it relied on authoring metadata in MOF files, and writing variable length messages in buffers corresponding exactly to that metadata. As Log4Net became more commonly used, we ended up with a mixed C#/managed C++/native C++ code base that allowed us to choose whether to use log4net or ETW tracing.

At that time, I decided to do some performance tests comparing log4net and ETW. I don't remember the exact numbers, but ETW tracing had much less overhead than log4net. With all tracing on the performance differences were pretty dramatic. That's when I knew if I ever had the chance, I would use ETW tracing again. I haven't done the same comparison with the EventSource implementation, but based on that experience I expect it to have much less overhead than Log4Net.

Reasons for using ETW tracing

Compared to traditional logging mechanism like Log4Net, ETW tracing has the following advantages:

Traces are not confined to an application. A multitude of traces sources (on the same machine) can be combined to a single trace. This makes it much easier to view correlations between different things happening on the same machine. Since Windows itself has extensive ETW tracing support (mainly introduced with Windows Vista), it allows us to combine our own tracing with OS traces or traces from other ETW instrumented applications. Even .Net is fully ETW instrumented.
Traces are not just strings, they are structured data. This means that they can be easily consumed by automated processes, and don't require the overhead of logging the same contextual textual info again and again.
Performance. I guess I already mentioned this.

Implementing EventSource.

Implementing ETW tracing with the EventSource class is very simple. I actually can't image how it can be made even more simple. It's at least a order of magnitude more simple to implement than using C++, and even somewhat simpler than using Log4Net. Basically, you only have to implement functions like the following code snippet:

        [Event(16, Level = EventLevel.Informational, Message = "Permanent redirect from {0} to {1}", Keywords = Keywords.Redirects)]
        public void PermanentRedirect(string sourceUrl, string destinationUrl)
        {
            WriteEvent(16, sourceUrl, destinationUrl);
        }

        [NonEvent]
        public void PermanentRedirect(Uri sourceUrl, Uri destinationUrl)
        {
            if (IsEnabled(EventLevel.Informational, Keywords.Redirects))
            {
                PermanentRedirect(sourceUrl.ToString(), destinationUrl.ToString());
            }
        }

Basically you write a function for which each parameter is a basic type corresponding to the variable data that you want to log e.g. public void PermanentRedirect(string sourceUrl, string destinationUrl). All required ETW metadata is automatically inferred from the .Net metadata (e.g. the field names sourceUrl and destinationUrl), or written explicitly as .NET attributes ([Event(16, Level = EventLevel.Informational, Message = "Permanent redirect from {0} to {1}", Keywords = Keywords.Redirects)]).

For performance reasons, it is best to write a helper variant that accepts higher level classes as arguments (e.g. public void PermanentRedirect(Uri sourceUrl, Uri destinationUrl)), that only does the necessary conversion to basic type arguments when there is actually a trace running by checking the IsEnabled function. Generating a trace event is just a matter of calling this method with the correct parameters.

More resources.

I can highly recommend Vance Morrison's blog as the primary source of information on ETW tracing and the EventSource implementation. In particular, his post about the EventSource specification is an extremely interesting overview. Also, EntLib now has a CTP of the Semantic Logging Application Block, which is intended to move you from unstructured textual logging to structured ETW based logging.

The Windows Azure part of the equation.

ETW tracing consists of 3 distinct actors: producers, consumers and controllers. The above code only implements the producer part. The consumers are standard Windows applications that know how process ETL files or live ETW data like PerfView (again, see Vance Morrison's blog).

What's missing here is the controller part. The controller is responsible for enabling traces, controlling trace level and flags and such, and controlling where to log to. You could log to a circular memory buffer as a 'black box' for your application, or to an ETL file for more permanent logging.

I didn't want to complicate my Windows Azure WebRole deployment by including one of the standard ETW controllers. Instead, I opted to use Vance Morrison's open source trace controller implementation. This class library enables turning on the ETW tracing to an ETL file in just a few lines of code. However, it turned out that I got very big ETL files pretty fast. Since I move those ETL files via Azure Diagnostics to Azure Storage for later analysis, it's important that these files are only as large as they need to be. I hadn't experienced abnormally large ETL files in my C++ days a decade ago, so I decided to investigate what made these files so much bigger than expected. Looking at the binary view of those ETL files, I saw that most of the content was just 0xFF filler bytes. It turns out that the size is impacted by:

The default buffer size. Having a large buffer size will make you initial ETL file size bigger. If there isn't enough tracing activity going on to fill up the buffers, the files will be larger than necessary. This can be improved by lowering the buffer size to something lower than the default 32MB via the property TraceEventSession.BufferSizeMB.
The flushing behavior. The TraceEvent code sets EVENT_TRACE_PROPERTIES.FlushTimer to 10 to flush every 10 seconds. Setting that to 0 (no flushing) solved the issue of continuously growing log files.

In the end, I felt the sizes were still too large for my usage scenario, so I compressed them with the new ZipFile support in .NET 4.5